Indica

let me jump into my flame suit and come out saying i'm a

.net

addict

using it for most of my 'beefy' systems and applications, while using

php

for web work. i'm rebuilding my entire

seo

system and would like to keep things in one language.

that said, i'm considering making the jump and going 100%

php

for the new system. one thing holding me back is my idea of

php

performance, can it compete with a multithreaded application? i'm trying to wrap my head around how to get the same performance in

php

as i do with my

.net

applications. i suppose you could execute the

php

script multiple times and achieve a multithreaded-like result? what do you do to gain the best performance?

also things like zend optimizer, what's your take on it: worth using, noticeable benefits?

i can already envision a sleak OO design for this thang, if i can get performance issues squared away. thanks in advance

php

gurus

nutballs

ah multithreading. perk will have a lot to say about it. you have to think a bit outside your normal, client app, line of thinking. I use it ALOT now, so does perk. I will elaborate if i get back before perk answers, but right now I gotta run out and beg for work... lol

Indica

yeah i know perk and multithreading are the epitome of a loving relationship Applause

looking forward to your response

perkiset

Performance issues...

OK, question: is it that you need lots of crunching, where you have C++ pounding the processor, or is it just that you need lots of processes running concurrently against others' websites and such?

(Important note: I am/have been a strong C++ and object pascal programmer and have written many, many multithreaded apps and daemons. I am no stranger to this discussion from all sides...)

The reason, is that there is a myth that multithreading really helps. The largest bottleneck in traditional setups (which I'd assume yours to be, particularly since you say "

SEO

"

is

net

work access and, in fact, the target site rather than your local processing.

Consider my personal spider: it is two scripts, a dispatcher and a spiderlet. I make heavy use of MySQL as well. The only manual thing that I need to do is put the starting URL into MySQL. The dispatcher, which runs either as often as I want or manually, looks as sees that there is a job to do - grabs the URL from a table and then dispatches a spiderlet to grab it. Spiderlets have a one job only: grab the URL they are assigned and put any new links they find into the todo stack in the database (they do other stuff as well Applause

, but essentially work a single target URL).

When the spiderlet is done, it dies. The dispatcher is configured to watch ps aux to see how many spiderlets are currently running, and keep (n) spiderlets running at any given time. In this way, I am using the OS for multiprocessing, instead of compiling something complicated as a threaded app.

Note here that I define "speed" also as how long it takes me to develop and maintain my warez. This little spider rig took me about 4 hours to go from concept to production, and it is still in production today. The same methdology works for my email blaster as well as ... erm ... well, other stuff.

It is true that head-to-head, a pure C++/# app will kick ass on a

PHP

app. But the question is, at what cost? Now you have two languages, you are bound to a compiled model, you cannot deploy it easily on lots and lots of other

mac

hines (note that my spider et al Applause

can run on cheap cheap hosts, effectively making my scalability limitless...) ... the complexity and maintenance factor, combined with the increased difficulty in scaling make compiled frameworks a non-starter for me anymore.

Note that with proper MySQL usage and watching the process list, I have eliminated the need for semaphores and mutexes, file locking, you name it - my systems are now so simple that most folks in love with complexity would laugh at me as a rookie. Applause

Fat lot they know... Applause

Indica

you've hit many of the issues there perk

quote author=perkiset link=topic=849.msg5908#msg5908 date=1206641388

OK, question: is it that you need lots of crunching, where you have C++ pounding the processor, or is it just that you need lots of processes running concurrently against others' websites and such?

mostly the latter, for scraping and other similar activities. i think i may still use

.net

(via mono) for processor-heavy tasks like page generation, keyword generation, stuff like that.

quote author=perkiset link=topic=849.msg5908#msg5908 date=1206641388

When the spiderlet is done, it dies. The dispatcher is configured to watch ps aux to see how many spiderlets are currently running, and keep (n) spiderlets running at any given time. In this way, I am using the OS for multiprocessing, instead of compiling something complicated as a threaded app.

this is how i envisioned doing it (after seeing you describe this setup here). are you able to watch ps aux (whatever that is, i've got to brush up on *nix, i feel like a fish out of water in it :roflmao

from shared hosts and such? if not, i think i can code my scripts in such a way that they run until there is no more work to be done. so using your spider example, each script would continue to run until the queue of urls is empty, thus eliminating the need for a watcher. this way i'd launch n amount of scripts and they'll run until works done. what do you think about this verses your method? what would the benefits and cons of such a method be that you can think of, in comparison to yours?

quote author=perkiset link=topic=849.msg5908#msg5908 date=1206641388

It is true that head-to-head, a pure C++/# app will kick ass on a

PHP

app. But the question is, at what cost? Now you have two languages, you are bound to a compiled model, you cannot deploy it easily on lots and lots of other

mac

hines (note that my spider et al Applause

can run on cheap cheap hosts, effectively making my scalability limitless...) ... the complexity and maintenance factor, combined with the increased difficulty in scaling make compiled frameworks a non-starter for me anymore.

this is *the* reason why i'm chosing

PHP

- scalability, specifically for cheap hosting. that simply cannot be done with

.net

applications, but

php

no problem: just upload the script, have them interface with the base, and do the bidding.

the aim is for the system to be extremely modular and portable, so portions of it can be distributed to a plethora of hosts and they will all work in unison to complete work.

the

PHP

.. its winning me over Applause

nutballs

Perk hit the problems so I wont cover those. pipe is the biggest.

How I multithread my

php

stuff is to use perks webrequest class to hit a processing page that replies back with an expected string and the webrequest class terminates early. it could all be done without his class, just made it easy for me to deal with (inventing wheels and all). As a result I have a loop in my requester that hits the processor 10 times per run, triggering 10 separate instances of the process. I dont have to wait around for anything to finish, and if you think a little backwards, you can have the processor hit an update page when each instance is done, however long it takes, so you dont have to wait around, and yet still know, or trigger a next step in your process.

Its just like having 10 people surfing your site. It doesnt wait for the first guy to be done until it replies to the next 9 guys.

perkiset

quote author=Indica link=topic=849.msg5912#msg5912 date=1206643905

mostly the latter, for scraping and other similar activities. i think i may still use

.net

(via mono) for processor-heavy tasks like page generation, keyword generation, stuff like that.

Oh man... if page generation is taking too long then you are doing it wrong. I don't have a site that is not 100% dynamic and just flat out kick ass. In fact, my

PHP

sites are, in some cases, over 10x the speed of my compiled apps because of the load that the whole framework has on it. With my

PHP

apps, they load and execute precisely, and only, what they absolutely must. But all that said, if you are still in a bind, then make page generation a distributed task like spidering and presto!

quote author=Indica link=topic=849.msg5912#msg5912 date=1206643905

this is how i envisioned doing it (after seeing you describe this setup here). are you able to watch ps aux (whatever that is, i've got to brush up on *nix, i feel like a fish out of water in it :roflmao

from shared hosts and such? if not, i think i can code my scripts in such a way that they run until there is no more work to be done. so using your spider example, each script would continue to run until the queue of urls is empty, thus eliminating the need for a watcher. this way i'd launch n amount of scripts and they'll run until works done. what do you think about this verses your method? what would the benefits and cons of such a method be that you can think of, in comparison to yours?

@ ps aux - this is a *nix command that tells you each of the processes that is currently running on your box. From

php

, you'd: $buff = shel_exec('ps aux') and $buff will contain the text that *nix popped out, then you'd

regex

it to extract what you want. So if you spawned 10 spiderlets, you'd have 10 lines describing the spiderlet process running
@ fire & forget - This works fine on a dedicated box, but you may have problems on shared hosting. The reason, is that your hosts may well (probably are) be watching your processes and may bag you for taking too much time. Short little bursty processes will not get their eire up, nor will you pop up on their radar. Also, if you have a multi function box (web sites, eblaster etc etc) then you can easily throttle various processes for your current priorities. This is a biggie for me, as I often go from bored to maxxed out in a very short amount of time... so I can tell my spiders that they are only allowed 3 processes, my blaster can do no more than 10/minute and I've got almost my entire box at my disposal (the processor anyway).

In general (opinion here) I do not like fire and forget anymore. I wrote a lot more of that sort of thing in the windoz environment... when I fully embraced a *nix way of thinking, it changed my programmatic style dramatically. I now find that I am drawn to coding tiny little processes that can start and stop on a dime, because the start and stop overhead does not exist like in a windows environment.

Another issue: if you write in a very webbish

PHP

way, then you are not thinking about garbage collection and freeing your objects. So writing daemons with this attitude you may be more likely to write leaky apps - or you might make use of a lib that is leaky and you wouldn't even know it, because it was never designed to run for 8 straight hours. If you spin off processes (don't even think of them as apps anymore) and they do a little something and quit, your memory management is automatic and absolute.

quote author=Indica link=topic=849.msg5912#msg5912 date=1206643905

this is *the* reason why i'm chosing

PHP

- scalability, specifically for cheap hosting. that simply cannot be done with

.net

applications, but

php

no problem: just upload the script, have them interface with the base, and do the bidding.

the aim is for the system to be extremely modular and portable, so portions of it can be distributed to a plethora of hosts and they will all work in unison to complete work.

exactly ... and there's another reason for me: I got tired of rebooting windows. Little teeny processes running on a *nix box are about as stable and boringly predictable it's not even funny. Coding like this wil make you the creator of appliances, rather than applications. Write apps for the

iPhone

... write appliance processes for those things that you want to fire and forget Applause

perkiset

quote author=nutballs link=topic=849.msg5914#msg5914 date=1206645203

Its just like having 10 people surfing your site. It doesnt wait for the first guy to be done until it replies to the next 9 guys.

Spot on NBs...

Apache

based multiprocessing is just shithot. The biggest benefit here is that

Apache

is, in fact, a multithreaded app so you can claim that you're threading Applause

but more importantly, you are using a single instance of

PHP

to do it, so you can (A) have WAY more processes running concurrently with almost no overhead and (B) make use of APC which is where the real speed comes in. I use APC to huge effect, but obviously, only on my own boxes and not shared hosting deployments.

Indica

hmm i see perk, good points about run time length i hadn't thought of that. could you be arsed to make a simple example of how you do your spiderlet/watcher setup? or is it around here already? i must confess i haven't read everything here, i've got like 50 pages of unread shit Applause

about page generation: my sites are generally dynamic also, that was more of an example than anything.

i think i can get use to firing off scripts in an on-demand fashion, though this entire thing is a complete shift in coding philosophy and style - back onto the training wheels Applause

surprisingly i never have to reboot my windows boxes much. i just built a new xp pro server and it's been running for weeks while scraping away. i told myself i wasn't going to put *nix on it, but i suppose now i will have to. thank jesus for vmware, i should be able to vm xp pro and still use it as normal. excuse me while i go pirate Applause

i've got

ubuntu

on my laptop, i suppose it's time i start using it full time so i can force myself to

learn

the way of the *nix. lord help me with that one, i have a hard enough time installing shit. and i've never been able to compile shit without 500 errors and *nix telling me to run back to windows Applause

unrelated sidenote: i've been toying with extjs's web desktop @ http://extjs.com/deploy/dev/examples/desktop/desktop.html it will make one sexy interface for the system. seems to have a strong set of built in libraries which will make presentation quick and easy.

perkiset

quote author=Indica link=topic=849.msg5918#msg5918 date=1206648333

hmm i see perk, good points about run time length i hadn't thought of that. could you be arsed to make a simple example of how you do your spiderlet/watcher setup? or is it around here already? i must confess i haven't read everything here, i've got like 50 pages of unread shit Applause

Check in the

PHP

Code repository for a scaled back version of my spider... I'll have a look in a bit if you can't find it.

quote author=Indica link=topic=849.msg5918#msg5918 date=1206648333

i think i can get use to firing off scripts in an on-demand fashion, though this entire thing is a complete shift in coding philosophy and style - back onto the training wheels Applause

It is, you're right on the dot. But it is also hugely worth it. I made that journey quite a while ago and have never looked back.