deregular

Im about to get hold a project, its looking as though it will be a very high traffic site.
Im just wondering if any of you guys have any advice, for server load, and/or

programming

  for high traffic applications?

Any tutes or articles, you can direct me to?

I want to make sure I have most bases covered before taking the job on.

As always, thanks in advance.

slaphappy

"High traffic" is relative, but when I think high traffic, I'm thinking load balancing.  If the traffic is truly high, you'll need it.  You'll also need the high availability load balancing gives you since lots of traffic usually means an important application.  It's hard to code around not having the right hardware for the job.  Your

mac

 hines will need a lot of CPU speed and a lot of RAM.  Squid caching servers are great for offloading static stuff like images.  Nothing's faster than Squid for that kind of stuff. 

The more static content and caching you can do, the better.  The less code you have to execute, the better.  You should look at using the Zend optimizer for your

PHP

  stuff. 

Database design could be what makes or breaks you though.  Assuming you are using MySQL, make sure you're using query caching.  Make sure you use indexes.  <>Don't write stupid queries.  Work closely with the DBA.
Again, hardware is going to make a difference here.  You need fast 15KRPM SCSI drives with a good RAID controller that's got a lot of cache on it, Fast CPUs and lots of RAM. 

I didn't really give you much coding advice, but you don't just code for this kind of thing.  You have to think about the entire architecture.  At the end of the day, if you don't have all of the above covered, it really doesn't matter what you code.

deregular

Thanks slap,
I kind of came to that conclusion myself. Load balancing will definately needed, it'll be similar to a job listing or rather a heavily promoted product promotion site, so i can see a lot of db called and images as well.

Seems Ive a little bit of research to do.

cheers

slaphappy

Can I ask what database you will be using?

deregular

It'll be

php

 /mysql.
I dont think it will be huge traffic to begin with since its local (australia) but they are looking at huge promotion and have budgeted for commercials etc.. in prime time. So I suppose it will spike a little.
So at tops Im guessing perhaps 400,000+ hits a day, I know that they will be looking at global eventually, but obviously that will be something I'll deal with when it arises.

slaphappy

Capacity planning is really pretty difficult when dealing with a huge unknown like an unwritten application. 

You're definitely going to have to do a lot of testing, simulating a lot of concurrent users.  A big company might use something like Mercury Load Runner, but that's really expensive.

You can download "iOpus

Mac

 ros" browser plugin for free on a bunch of PCs and "play back" browser sessions in a loop.  That should give you <>some idea of how you will perform under load from real users.  ab (

apache

  benchmark) will give you a measurement of total throughput from a single URL.  This is pretty good to use for a singe page you know is going to be kind of heavy.  It can help you see if your tweaking is really making an improvement, and by how much.

Keep track of how much bandwidth you are using too.  You web servers will melt down if it fills the pipe.  What happens is it can't complete the requests fast enough, and the number of connections to

Apache

  start climbing.  You start running out of memory and cpu, and it's a downward spiral from there.  I've seen this happen all too often under huge traffic spikes we weren't ready for.  Fortunately, modern load balancers have connection limiting features.  You might not serve everybody, but at least you'll serve who you were prepared for.  They also have connection "concentrating" where the load balancer opens up say 5 connections to the webserver, but is sending 50 real connections over it.  <>This is huge.  It really reduces the memory

Apache

  has to consume and lets the

mac

 hine work more efficiently.  Compression from the load balancer helps with the bandwidth too without consuming your web servers' cpu for mod_deflate.  Unfortunately, a high end load balancer can be pretty expensive too.

Anyway, this is from Sys Admin's (sometimes coder) perspective after years of dealing with high volume traffic. I've pretty much seen it all, lol.  I've been out of that for about 1 1/2 years now.  These days I mainly work on giant DB2 databases and little Weblogic applications.

deregular

Glad you're around slap, this shit makes my head spin.
Awesome info by the way.

I think it being a local project, i will have at least some time to analyse and do some testing, when it launches.

Im thinking along the lines of seperating the mysql database and

php

  routines onto different servers and see how I go from there. After doing some research today, there seem to be a few dedicated hosts that are willing to give 24/7 monitoring and in an emergency do what is needed in order to keep things running. And looking at the bigger picture, this one doesn't look like it will be the nightmare that I think it will be, or that I have read about others having.

At this stage, money doesn't seem to be a problem for these guys, so I have at least a little room to move.

I have a couple of  local servers here that I'll setup as temps to test as Im building. Thanks for the tips on iOpus

Mac

 ros, most likely I'll run this a little on the login and index pages. Perhaps Im going overboard, with worry, but I suppose being aware of the potential of such a project is the first step in being prepared for the worst.

As above, Im seriously looking at 2 dedicated servers first up. One dedicated to Mysql due to what I forsee as a lot of calls to the product database and especially those just browsing around pulling info and running searches, the other to standard

php

  routines, images and as a mail server. Then I'll be running some tests to see if load bearing solutions are called for. Obviously because it will be a login, contribution system, DNS routing wont cut it, I'll have to keep sessions alive once users have made a call to script.

Having never dealt with such traffic before, its a

learn

 ing  curve for me. If I could Id love to call on you for some advice somewhere along the line once things get going.

Your advice is very much appreciated.

cheers mate

Any thoughts on cached mysql queries?

slaphappy

It's a definitely good idea to have MySQL and

Apache

  on different

mac

 hines.  Performance being the first reason.  Second, when it's time to add another web server, the architecture will be in place to drop it in and point it do the dedicated database server.  You'll want a private gigabit

net

 work between everything, so you'll probably end up with a partial rack somewhere to house everything.  Definitely cache the queries.  Ram is cheap, and it will speed things up.

It's good that it sounds like you are having a "soft" launch.  There's testing, and then there's the real world.  You will see how things work for real when you get that going.

Good luck!

perkiset

Lots of good stuff in their Slap - I'd like to weigh in with my personal experience as well.

Load Balancing - There are hardware loadbalancers & software and such, but I use two different methods which have served me well, are free and strong *enough*. My most prevalent is to use

Apache

  itself. I use lookup tables in combination with mod_rewrite to proxy requests inward to

mac

 hines that do the actual HTML rendering. In my highest traffic situations, graphics and videos are servered from an entirely separate

mac

 hine. Here's the structure of my highest volume web apps:

* HTML requests come to a pretty good sized Sun box running tightened-down

Solaris

  and

Apache

 .
* Based on the request and the surfer type the request is either rewritten into a local

PHP

  script that handles tiny stuff, or into

PHP

  boxes that handle all

SEO

  Spider requests, or into a larger cluster of

mac

 hines that do rendering for real-surfer HTML.
* Each of these

mac

 hines has 2

net

 work cards: the other connection is to a gig-switched

net

  that has a MySQL and memcached

mac

 hine on it. So essentially, the "spine" of my rendering

mac

 hines/database/memcache is quiet except for inter-

mac

 hine comms.
* Surfer state is kept in memcached, obviously inventory, customer recs etc are kept in the database.
* I do not cache DB queries, I cache chunks of HTML - so rather than re-querying for a product gallery, I ask memcache for the latest version of the HTML of it. This HTML is modified whenever a change to inventory is made. I put the burden of updating the HTML on the inventory modifier, rather than the surfer. That saves a *lot* of time.
* The page that is created contains references to graphics that live on another

mac

 hine entirely (graphics.mydomain.com rather than www. mydomain.com) - so the graphics and videos portion of the website are squid cached, fast as hell and out of the way of the tunnel and processing of the HTML.

The loadbalancing works by

Apache

  randomizing which

mac

 hine it will go to from a text-file list of available

mac

 hines.

Apache

  watches the mtime of this file - if it's changed, then on the next page load it is pulled up and cached again. This gives me the ability to use a little app that I call clusman (Cluster Manager) to reroute traffic any way I see necessary instantly. Converting a

mac

 hine to staging server (rather than production), taking

mac

 hines out for maintenance or dropping a

mac

 hine entirely if it is out of commission is pretty effortless. Also, by changing the number of references to each

mac

 hine in the clusman file, I can change the likelihood that <

mac

 hine x> will get picked [n%] of the time - meaning that if one

mac

 hine is considerably faster than another, I can give it a larger percentage of the traffic. Below is a modified example of an

Apache

  config virtual host for one of these kinds of sites. In this example (chopped a bit and modified to protect the innocent) I have 4

mac

 hines rendering for production and 1 for staging. I am using the local

mac

 hine for graphics requests, local

mac

 hine for

PHP

  spidersite work and proxying into the renderers.


<VirtualHost 1.2.3.4:80>
        ServerName              www.mydomain.com
        DocumentRoot            /www/htdocs/mydomain
        RewriteEngine           on
        RewriteMap              cluster  rnd:/www/resource/db/cluster_list
        RewriteMap              denied  dbm:/www/resource/db/denylist.map
        RewriteMap              botbase dbm:/www/resource/db/botbase.map
        RewriteMap              xlate   dbm:/www/resource/db/xlate_mydomain.map
        #RewriteLog             /www/resource/rewrite.log
        #RewriteLogLevel        10
        Options                 +FollowSymLinks

        # These are just the

javascript

  or graphics aliases
        Alias /graphics /www/graphics/mydomain/webimages
        Alias /photos /www/graphics/mydomain/product
        Alias /js /www/htdocs/global

        # If it's a resource hit (graphics etc) then succeed and end rewriting...
        RewriteCond             %{REQUEST_URI}  /graphics       [OR]
        RewriteCond             %{REQUEST_URI}  /photos         [OR]
        RewriteCond             %{REQUEST_URI}  /js
        RewriteRule             ^(.*)$          -               <>

        # It may be a browser that wants the favicon and won't read my redirect...
        RewriteCond             %{REQUEST_URI}  /favicon.ico
        RewriteRule             ^(.*)$           -               <>

        # If found in the denied list, then stub the request out...
        RewriteCond             ${denied:%{REMOTE_ADDR}|0}      >0
        RewriteRule             ^(.*)$          /www/htdocs/deny.html   <>

        # if it's a spider, reroute the request into the local spidersite

php

  system...
        # Note that regardless what input parameters come in, I rewrite them to
        # nothing but the searching engine. Spider URLS do not contain parameters.
        RewriteCond             ${botbase:%{REMOTE_ADDR}|0} >0
        RewriteRule             ^(.*)$          /

php

 /spidersite/main.

php

 $1?engine=${botbase:%{REMOTE_ADDR}}      <>

        # if the stub file is up, then I don't want any users coming back to the framework...
        RewriteCond             /www/resource/stub      -f
        RewriteRule             ^(.*)$  /stub.html  <>

        # It could be a surfer with a spider URL - if so, translate to the correct landing zone...
        RewriteCond             ${xlate:%{REQUEST_URI}|,} >,
        RewriteRule ^(.*)$ http://${clusterApplausenline}/${xlate:%{REQUEST_URI}}?__site_id=SBTD1.0002&origip=%{REMOTE_ADDR}&port=%{SERVER_PORT} [P,L,QSA]

        # Finally - it's just a normal (surfer) request - proxy it on...
        RewriteRule ^(.*)$ http://${clusterApplausenline}$1?__site_id=SBTD1.0002&origip=%{REMOTE_ADDR}&port=%{SERVER_PORT} [P,L,QSA]
</VirtualHost>


The cluster file for this example looks like this:

online     rcluster_01|rcluster_02|rcluster_03|rcluster_04
stage      rcluster_05


The other way I do some load balancing is with IPTables in IPCop (firewall to another set of renderers). By specifying how many packets I want to go to which address, I get the effect of load balancing... but it is less intelligent, configurable or practical as the

Apache

  method.

Mac

 hine and inter-

mac

 hine speed:
I keep

mac

 hine throughput up by using APC for both keeping the

PHP

  compiled as well as user cache items. As I mentioned above, the "private

net

 work" that exists between my renderers and my database/memcache is quiet except for this traffic.

From a

programming

  perspective:
IMO, the more you can perceive webpages to be objects that *ap

pear

 * to be one big website/application (because they look the same) but are actually tiny little self-standing apps that do precisely what they need to do and little more, the more you will be able to rely on the natural chaos of surfer requests to balance the processing load on any one

mac

 hine. What I mean by this, is that if you have this huge monolithic app that controls loads of sites and configs and all kinds of elegant but heavyweight stuff, then every page pull will burden the server with [that] much work.. but if this surfer is calling a little cached page and that one is calling a gallery and and and... then the load will be more light and quick. This may be one of the primary reasons that, after a considerable amount of research, I passed on an EJB methdology for my new systems and went with a

PHP

 /JS/

Ajax

  like methodology.

From a protection standpoint: Offer NO PORTS on your

mac

 hine other than 80 (and 443 if required). If you can, front end your systems with a firewall like IPCop that does not respond to anything except port 80. This will keep script kiddies from getting too excited about your box. If you even respond at all to SMTP, POP, FTP, SFTP, SSH (you get the idea) you will get hammered by cracker bots trying to get in. Your

mac

 hine response times will suffer (even though the bots don't get in) because it's sort of likle a mini-DOS attack. Track IPs and look for cookies - if a rogue or amateur bot maker is hitting your

mac

 hines regularly, have a list of IPs that you automatically ban or route out into space. If this list is also in a lookup table for

Apache

  (mine is) then before you even get to your scripts then

Apache

  can get rid of the interloper. Keep your bandwidth for your real users and spiders.

Interesting story: I tried to bring SAMBA up on the internal side of a non-firewalled

Solaris

  box once. Now the SAMBA only ap

pear

 ed on the internal NIC, but since there was a service running on that port, an interesting change happened in the

mac

 hine: where a ping or request to a certain port previously was returned with simple "request denied" by the box, now the request hung out there in space - it never went anywhere, but it was clear there was something different about <that> port on the external nic. About a hour after I had SAMBA running I was receiving 1000s of hits by eastern European bots that were trying everything they could against that port. It was insane. So again, offer nothing except 80 (and 443) publicly and you'll be much better off.

Last but certainly not least - I am using more and more

javascript

  and

ajax

 -like mechanisms to push processing out to the client and keep my processing down. The ability to spread the processing around to the people that are viewing you should not be underestimated - it can make a profound impact on your throughput. Things like:
* If a surfer hasn't yet viewed a particular place in a product gallery, then there's no need to have downloaded the images for it yet - by using JS as a way to only download what I *must* have a client side, I reduce traffic congestion and processing time on my side.
* Don't throw a form or

ajax

  request up to the servers until you have completely verified that it is valid client side
* Use

ajax

 -like mechanisms to pull down only what you *must* pull down to give the user what they want - don't pull the trigger on a whole page reload if a little change of data will suffice.

There's more, but looking at this list so far you'll probably run screaming. Sorry 'bout that... Applause

/p

thedarkness

Man that's tight perk.

Dereg, you in the right place dude.

Cheers,
td

deregular

I Reckon TD. And nah I wont run screamin perk. I find it a challenge.
A problem 'always' has a solution.. well thats my school of thought anyway.

Thanks for all the advice guys, absolutely brilliant insights.

I'll keep ya posted on how i go.

cheers

mrsdf

quote author=perkiset link=topic=477.msg3093#msg3093 date=1188779512


From a protection standpoint: Offer NO PORTS on your

mac

 hine other than 80 (and 443 if required). If you can, front end your systems with a firewall like IPCop that does not respond to anything except port 80. This will keep script kiddies from getting too excited about your box. If you even respond at all to SMTP, POP, FTP, SFTP, SSH (you get the idea) you will get hammered by cracker bots trying to get in. Your

mac

 hine response times will suffer (even though the bots don't get in) because it's sort of likle a mini-DOS attack. Track IPs and look for cookies - if a rogue or amateur bot maker is hitting your

mac

 hines regularly, have a list of IPs that you automatically ban or route out into space. If this list is also in a lookup table for

Apache

  (mine is) then before you even get to your scripts then

Apache

  can get rid of the interloper. Keep your bandwidth for your real users and spiders.

Interesting story: I tried to bring SAMBA up on the internal side of a non-firewalled

Solaris

  box once. Now the SAMBA only ap

pear

 ed on the internal NIC, but since there was a service running on that port, an interesting change happened in the

mac

 hine: where a ping or request to a certain port previously was returned with simple "request denied" by the box, now the request hung out there in space - it never went anywhere, but it was clear there was something different about <that> port on the external nic. About a hour after I had SAMBA running I was receiving 1000s of hits by eastern European bots that were trying everything they could against that port. It was insane. So again, offer nothing except 80 (and 443) publicly and you'll be much better off.



paranoid ++ : If you ever need remote shell access to the system, I'd use a port knocking mechanism to keep it hidden (some people would argue against port knocking as being a 'security by obscurity' mechanism, but I just consider it an additional level of security/hiding things). Basically what this does is allow you to open a ssh daemon running on a custom port, that only becomes visible after the client that wants to connect sends a very specific sequence of packets to the server. The ssh daemon will still require user/password (use key authentication in ssh), but it will be absolutely invisible to any user/bot scanning the

mac

 hine unless he/it knows the sequence of packets to send. It's all done at firewall level, no visible service running. There's more info out there on the web, and a lot of arguments for using and for not using this.

perkiset

Actually, a really nice tip for those that can't put a firewall box in front of their service boxes man... IMO every obstacle you can throw up against the attackers is a help - it'll just make the guy right next to you an easier target - and since there's plenty of them, you may get away more.

I have VPNs into my stuff so that all ports are open for anything I need... behind the wall. I use OpenVPN from the road and a dedi firewall/VPN

net

 -to

-net

  solution (IPCop) from my desk. Works great and I can sleep at night Applause


Perkiset's Place Home   Politics @ Perkiset's