The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 21, 2019, 12:47:24 PM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: Coding for high traffic  (Read 5934 times)
deregular
Expert
****
Offline Offline

Posts: 172


View Profile
« on: September 01, 2007, 08:10:57 PM »

Im about to get hold a project, its looking as though it will be a very high traffic site.
Im just wondering if any of you guys have any advice, for server load, and/or programming for high traffic applications?

Any tutes or articles, you can direct me to?

I want to make sure I have most bases covered before taking the job on.

As always, thanks in advance.
Logged
slaphappy
n00b
*
Offline Offline

Posts: 7



View Profile
« Reply #1 on: September 01, 2007, 09:38:46 PM »

"High traffic" is relative, but when I think high traffic, I'm thinking load balancing.  If the traffic is truly high, you'll need it.  You'll also need the high availability load balancing gives you since lots of traffic usually means an important application.  It's hard to code around not having the right hardware for the job.  Your machines will need a lot of CPU speed and a lot of RAM.  Squid caching servers are great for offloading static stuff like images.  Nothing's faster than Squid for that kind of stuff. 

The more static content and caching you can do, the better.  The less code you have to execute, the better.  You should look at using the Zend optimizer for your PHP stuff. 

Database design could be what makes or breaks you though.  Assuming you are using MySQL, make sure you're using query caching.  Make sure you use indexes.  Don't write stupid queries.  Work closely with the DBA.
Again, hardware is going to make a difference here.  You need fast 15KRPM SCSI drives with a good RAID controller that's got a lot of cache on it, Fast CPUs and lots of RAM. 

I didn't really give you much coding advice, but you don't just code for this kind of thing.  You have to think about the entire architecture.   At the end of the day, if you don't have all of the above covered, it really doesn't matter what you code.
Logged

No links in signatures please
deregular
Expert
****
Offline Offline

Posts: 172


View Profile
« Reply #2 on: September 01, 2007, 09:45:14 PM »

Thanks slap,
I kind of came to that conclusion myself. Load balancing will definately needed, it'll be similar to a job listing or rather a heavily promoted product promotion site, so i can see a lot of db called and images as well.

Seems Ive a little bit of research to do.

cheers
Logged
slaphappy
n00b
*
Offline Offline

Posts: 7



View Profile
« Reply #3 on: September 01, 2007, 09:48:07 PM »

Can I ask what database you will be using?
Logged

No links in signatures please
deregular
Expert
****
Offline Offline

Posts: 172


View Profile
« Reply #4 on: September 01, 2007, 10:35:58 PM »

It'll be php/mysql.
I dont think it will be huge traffic to begin with since its local (australia) but they are looking at huge promotion and have budgeted for commercials etc.. in prime time. So I suppose it will spike a little.
So at tops Im guessing perhaps 400,000+ hits a day, I know that they will be looking at global eventually, but obviously that will be something I'll deal with when it arises.
Logged
slaphappy
n00b
*
Offline Offline

Posts: 7



View Profile
« Reply #5 on: September 02, 2007, 07:05:43 AM »

Capacity planning is really pretty difficult when dealing with a huge unknown like an unwritten application. 

You're definitely going to have to do a lot of testing, simulating a lot of concurrent users.  A big company might use something like Mercury Load Runner, but that's really expensive.

You can download "iOpus Macros" browser plugin for free on a bunch of PCs and "play back" browser sessions in a loop.  That should give you some idea of how you will perform under load from real users.  ab (apache benchmark) will give you a measurement of total throughput from a single URL.  This is pretty good to use for a singe page you know is going to be kind of heavy.  It can help you see if your tweaking is really making an improvement, and by how much.

Keep track of how much bandwidth you are using too.  You web servers will melt down if it fills the pipe.  What happens is it can't complete the requests fast enough, and the number of connections to Apache start climbing.  You start running out of memory and cpu, and it's a downward spiral from there.  I've seen this happen all too often under huge traffic spikes we weren't ready for.  Fortunately, modern load balancers have connection limiting features.  You might not serve everybody, but at least you'll serve who you were prepared for.  They also have connection "concentrating" where the load balancer opens up say 5 connections to the webserver, but is sending 50 real connections over it.  This is huge.  It really reduces the memory Apache has to consume and lets the machine work more efficiently.  Compression from the load balancer helps with the bandwidth too without consuming your web servers' cpu for mod_deflate.  Unfortunately, a high end load balancer can be pretty expensive too.

Anyway, this is from Sys Admin's (sometimes coder) perspective after years of dealing with high volume traffic. I've pretty much seen it all, lol.  I've been out of that for about 1 1/2 years now.  These days I mainly work on giant DB2 databases and little Weblogic applications.
Logged

No links in signatures please
deregular
Expert
****
Offline Offline

Posts: 172


View Profile
« Reply #6 on: September 02, 2007, 08:07:05 AM »

Glad you're around slap, this shit makes my head spin.
Awesome info by the way.

I think it being a local project, i will have at least some time to analyse and do some testing, when it launches.

Im thinking along the lines of seperating the mysql database and php routines onto different servers and see how I go from there. After doing some research today, there seem to be a few dedicated hosts that are willing to give 24/7 monitoring and in an emergency do what is needed in order to keep things running. And looking at the bigger picture, this one doesn't look like it will be the nightmare that I think it will be, or that I have read about others having.

At this stage, money doesn't seem to be a problem for these guys, so I have at least a little room to move.

I have a couple of  local servers here that I'll setup as temps to test as Im building. Thanks for the tips on iOpus Macros, most likely I'll run this a little on the login and index pages. Perhaps Im going overboard, with worry, but I suppose being aware of the potential of such a project is the first step in being prepared for the worst.

As above, Im seriously looking at 2 dedicated servers first up. One dedicated to Mysql due to what I forsee as a lot of calls to the product database and especially those just browsing around pulling info and running searches, the other to standard php routines, images and as a mail server. Then I'll be running some tests to see if load bearing solutions are called for. Obviously because it will be a login, contribution system, DNS routing wont cut it, I'll have to keep sessions alive once users have made a call to script.

Having never dealt with such traffic before, its a learning curve for me. If I could Id love to call on you for some advice somewhere along the line once things get going.

Your advice is very much appreciated.

cheers mate

Any thoughts on cached mysql queries?
Logged
slaphappy
n00b
*
Offline Offline

Posts: 7



View Profile
« Reply #7 on: September 02, 2007, 10:49:46 AM »

It's a definitely good idea to have MySQL and Apache on different machines.  Performance being the first reason.  Second, when it's time to add another web server, the architecture will be in place to drop it in and point it do the dedicated database server.  You'll want a private gigabit network between everything, so you'll probably end up with a partial rack somewhere to house everything.  Definitely cache the queries.  Ram is cheap, and it will speed things up.

It's good that it sounds like you are having a "soft" launch.  There's testing, and then there's the real world.  You will see how things work for real when you get that going.

Good luck!

Logged

No links in signatures please
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #8 on: September 02, 2007, 05:31:52 PM »

Lots of good stuff in their Slap - I'd like to weigh in with my personal experience as well.

Load Balancing - There are hardware loadbalancers & software and such, but I use two different methods which have served me well, are free and strong *enough*. My most prevalent is to use Apache itself. I use lookup tables in combination with mod_rewrite to proxy requests inward to machines that do the actual HTML rendering. In my highest traffic situations, graphics and videos are servered from an entirely separate machine. Here's the structure of my highest volume web apps:

* HTML requests come to a pretty good sized Sun box running tightened-down Solaris and Apache.
* Based on the request and the surfer type the request is either rewritten into a local PHP script that handles tiny stuff, or into PHP boxes that handle all SEO Spider requests, or into a larger cluster of machines that do rendering for real-surfer HTML.
* Each of these machines has 2 network cards: the other connection is to a gig-switched net that has a MySQL and memcached machine on it. So essentially, the "spine" of my rendering machines/database/memcache is quiet except for inter-machine comms.
* Surfer state is kept in memcached, obviously inventory, customer recs etc are kept in the database.
* I do not cache DB queries, I cache chunks of HTML - so rather than re-querying for a product gallery, I ask memcache for the latest version of the HTML of it. This HTML is modified whenever a change to inventory is made. I put the burden of updating the HTML on the inventory modifier, rather than the surfer. That saves a *lot* of time.
* The page that is created contains references to graphics that live on another machine entirely (graphics.mydomain.com rather than www. mydomain.com) - so the graphics and videos portion of the website are squid cached, fast as hell and out of the way of the tunnel and processing of the HTML.

The loadbalancing works by Apache randomizing which machine it will go to from a text-file list of available machines. Apache watches the mtime of this file - if it's changed, then on the next page load it is pulled up and cached again. This gives me the ability to use a little app that I call clusman (Cluster Manager) to reroute traffic any way I see necessary instantly. Converting a machine to staging server (rather than production), taking machines out for maintenance or dropping a machine entirely if it is out of commission is pretty effortless. Also, by changing the number of references to each machine in the clusman file, I can change the likelihood that <machine x> will get picked [n%] of the time - meaning that if one machine is considerably faster than another, I can give it a larger percentage of the traffic. Below is a modified example of an Apache config virtual host for one of these kinds of sites. In this example (chopped a bit and modified to protect the innocent) I have 4 machines rendering for production and 1 for staging. I am using the local machine for graphics requests, local machine for PHP spidersite work and proxying into the renderers.

Code:
<VirtualHost 1.2.3.4:80>
        ServerName              www.mydomain.com
        DocumentRoot            /www/htdocs/mydomain
        RewriteEngine           on
        RewriteMap              cluster  rnd:/www/resource/db/cluster_list
        RewriteMap              denied  dbm:/www/resource/db/denylist.map
        RewriteMap              botbase dbm:/www/resource/db/botbase.map
        RewriteMap              xlate   dbm:/www/resource/db/xlate_mydomain.map
        #RewriteLog             /www/resource/rewrite.log
        #RewriteLogLevel        10
        Options                 +FollowSymLinks

        # These are just the javascript or graphics aliases
        Alias /graphics /www/graphics/mydomain/webimages
        Alias /photos /www/graphics/mydomain/product
        Alias /js /www/htdocs/global

        # If it's a resource hit (graphics etc) then succeed and end rewriting...
        RewriteCond             %{REQUEST_URI}  /graphics       [OR]
        RewriteCond             %{REQUEST_URI}  /photos         [OR]
        RewriteCond             %{REQUEST_URI}  /js
        RewriteRule             ^(.*)$          -               [L]

        # It may be a browser that wants the favicon and won't read my redirect...
        RewriteCond             %{REQUEST_URI}  /favicon.ico
        RewriteRule             ^(.*)$           -               [L]

        # If found in the denied list, then stub the request out...
        RewriteCond             ${denied:%{REMOTE_ADDR}|0}      >0
        RewriteRule             ^(.*)$          /www/htdocs/deny.html   [L]

        # if it's a spider, reroute the request into the local spidersite php system...
        # Note that regardless what input parameters come in, I rewrite them to
        # nothing but the searching engine. Spider URLS do not contain parameters.
        RewriteCond             ${botbase:%{REMOTE_ADDR}|0} >0
        RewriteRule             ^(.*)$          /php/spidersite/main.php$1?engine=${botbase:%{REMOTE_ADDR}}      [L]

        # if the stub file is up, then I don't want any users coming back to the framework...
        RewriteCond             /www/resource/stub      -f
        RewriteRule             ^(.*)$  /stub.html  [L]

        # It could be a surfer with a spider URL - if so, translate to the correct landing zone...
        RewriteCond             ${xlate:%{REQUEST_URI}|,} >,
        RewriteRule ^(.*)$ http://${cluster:online}/${xlate:%{REQUEST_URI}}?__site_id=SBTD1.0002&origip=%{REMOTE_ADDR}&port=%{SERVER_PORT} [P,L,QSA]

        # Finally - it's just a normal (surfer) request - proxy it on...
        RewriteRule ^(.*)$ http://${cluster:online}$1?__site_id=SBTD1.0002&origip=%{REMOTE_ADDR}&port=%{SERVER_PORT} [P,L,QSA]
</VirtualHost>

The cluster file for this example looks like this:
Code:
online     rcluster_01|rcluster_02|rcluster_03|rcluster_04
stage      rcluster_05

The other way I do some load balancing is with IPTables in IPCop (firewall to another set of renderers). By specifying how many packets I want to go to which address, I get the effect of load balancing... but it is less intelligent, configurable or practical as the Apache method.

Machine and inter-machine speed: I keep machine throughput up by using APC for both keeping the PHP compiled as well as user cache items. As I mentioned above, the "private network" that exists between my renderers and my database/memcache is quiet except for this traffic.

From a programming perspective: IMO, the more you can perceive webpages to be objects that *appear* to be one big website/application (because they look the same) but are actually tiny little self-standing apps that do precisely what they need to do and little more, the more you will be able to rely on the natural chaos of surfer requests to balance the processing load on any one machine. What I mean by this, is that if you have this huge monolithic app that controls loads of sites and configs and all kinds of elegant but heavyweight stuff, then every page pull will burden the server with [that] much work.. but if this surfer is calling a little cached page and that one is calling a gallery and and and... then the load will be more light and quick. This may be one of the primary reasons that, after a considerable amount of research, I passed on an EJB methdology for my new systems and went with a PHP/JS/Ajax like methodology.

From a protection standpoint: Offer NO PORTS on your machine other than 80 (and 443 if required). If you can, front end your systems with a firewall like IPCop that does not respond to anything except port 80. This will keep script kiddies from getting too excited about your box. If you even respond at all to SMTP, POP, FTP, SFTP, SSH (you get the idea) you will get hammered by cracker bots trying to get in. Your machine response times will suffer (even though the bots don't get in) because it's sort of likle a mini-DOS attack. Track IPs and look for cookies - if a rogue or amateur bot maker is hitting your machines regularly, have a list of IPs that you automatically ban or route out into space. If this list is also in a lookup table for Apache (mine is) then before you even get to your scripts then Apache can get rid of the interloper. Keep your bandwidth for your real users and spiders.

Interesting story: I tried to bring SAMBA up on the internal side of a non-firewalled Solaris box once. Now the SAMBA only appeared on the internal NIC, but since there was a service running on that port, an interesting change happened in the machine: where a ping or request to a certain port previously was returned with simple "request denied" by the box, now the request hung out there in space - it never went anywhere, but it was clear there was something different about <that> port on the external nic. About a hour after I had SAMBA running I was receiving 1000s of hits by eastern European bots that were trying everything they could against that port. It was insane. So again, offer nothing except 80 (and 443) publicly and you'll be much better off.

Last but certainly not least - I am using more and more javascript and ajax-like mechanisms to push processing out to the client and keep my processing down. The ability to spread the processing around to the people that are viewing you should not be underestimated - it can make a profound impact on your throughput. Things like:
* If a surfer hasn't yet viewed a particular place in a product gallery, then there's no need to have downloaded the images for it yet - by using JS as a way to only download what I *must* have a client side, I reduce traffic congestion and processing time on my side.
* Don't throw a form or ajax request up to the servers until you have completely verified that it is valid client side
* Use ajax-like mechanisms to pull down only what you *must* pull down to give the user what they want - don't pull the trigger on a whole page reload if a little change of data will suffice.

There's more, but looking at this list so far you'll probably run screaming. Sorry 'bout that... Wink

/p
« Last Edit: September 02, 2007, 05:34:55 PM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
thedarkness
Lifer
*****
Offline Offline

Posts: 585



View Profile
« Reply #9 on: September 03, 2007, 03:23:57 AM »

Man that's tight perk.

Dereg, you in the right place dude.

Cheers,
td
Logged

"I want to be the guy my dog thinks I am."
 - Unknown
deregular
Expert
****
Offline Offline

Posts: 172


View Profile
« Reply #10 on: September 03, 2007, 04:01:07 AM »

I Reckon TD. And nah I wont run screamin perk. I find it a challenge.
A problem 'always' has a solution.. well thats my school of thought anyway.

Thanks for all the advice guys, absolutely brilliant insights.

I'll keep ya posted on how i go.

cheers
Logged
mrsdf
Rookie
**
Offline Offline

Posts: 20



View Profile
« Reply #11 on: September 03, 2007, 11:35:09 AM »


From a protection standpoint: Offer NO PORTS on your machine other than 80 (and 443 if required). If you can, front end your systems with a firewall like IPCop that does not respond to anything except port 80. This will keep script kiddies from getting too excited about your box. If you even respond at all to SMTP, POP, FTP, SFTP, SSH (you get the idea) you will get hammered by cracker bots trying to get in. Your machine response times will suffer (even though the bots don't get in) because it's sort of likle a mini-DOS attack. Track IPs and look for cookies - if a rogue or amateur bot maker is hitting your machines regularly, have a list of IPs that you automatically ban or route out into space. If this list is also in a lookup table for Apache (mine is) then before you even get to your scripts then Apache can get rid of the interloper. Keep your bandwidth for your real users and spiders.

Interesting story: I tried to bring SAMBA up on the internal side of a non-firewalled Solaris box once. Now the SAMBA only appeared on the internal NIC, but since there was a service running on that port, an interesting change happened in the machine: where a ping or request to a certain port previously was returned with simple "request denied" by the box, now the request hung out there in space - it never went anywhere, but it was clear there was something different about <that> port on the external nic. About a hour after I had SAMBA running I was receiving 1000s of hits by eastern European bots that were trying everything they could against that port. It was insane. So again, offer nothing except 80 (and 443) publicly and you'll be much better off.


paranoid ++ : If you ever need remote shell access to the system, I'd use a port knocking mechanism to keep it hidden (some people would argue against port knocking as being a 'security by obscurity' mechanism, but I just consider it an additional level of security/hiding things). Basically what this does is allow you to open a ssh daemon running on a custom port, that only becomes visible after the client that wants to connect sends a very specific sequence of packets to the server. The ssh daemon will still require user/password (use key authentication in ssh), but it will be absolutely invisible to any user/bot scanning the machine unless he/it knows the sequence of packets to send. It's all done at firewall level, no visible service running. There's more info out there on the web, and a lot of arguments for using and for not using this.
Logged

We're sp4mmin', we're sp4mmin', I hope you like sp4mmin' too...
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #12 on: September 03, 2007, 11:42:57 AM »

Actually, a really nice tip for those that can't put a firewall box in front of their service boxes man... IMO every obstacle you can throw up against the attackers is a help - it'll just make the guy right next to you an easier target - and since there's plenty of them, you may get away more.

I have VPNs into my stuff so that all ports are open for anything I need... behind the wall. I use OpenVPN from the road and a dedi firewall/VPN net-to-net solution (IPCop) from my desk. Works great and I can sleep at night Wink
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!