Thread: OK Time to pcntl_fork! - Questions.

Back to category: PHP: Techniques, Classes & Examples

vsloathe

I've decided to give forking a go. I know it will take some work, but after this is over I hope to be able to fork all night long.

I need a method though as far as the actual execution goes, so here is the basic gist of what I want to do:

Fork the process, grab a web page, give the captcha from the web page back out to the browser, and wait for the user to enter the answer. Once the user has entered the answer, I will continue with the child, submit the POST and then die. I was talking with TD earlier and he mentioned using semaphores or mutexes to accomplish this. I understand the concept, but I need someone to explain the details or perhaps provide examples of this to my dumb ass.

Anyone?

Thanks.

jammaster82

I would, but i dont have the 'tiene'?

Applause

am interested in

learn

ing this as well..

I know very little about it at all but im
guessing
Have a page generate a unique id specific to this
session and user and store it into a table like this:

insert into semaphoretable (uniqueidfield, status) values ('000001','waiting');

then xfr to a page that just does a 1 second refresh

select status from semaphoretable where uniqueid='000001'

if its waiting, refresh to same page, if its finally updated
as captcha_ok, by the captcha_check page and its
subsidiary pages, then you can continue...

just dont forget to clean up the semaphore table later..

The forked off process being the captcha thread
and if succesful, does a :

insert into semaphore table (uniqueidfield, status) values ('000001','captcha_ok');

probably the wrong way .. interested in what follows
jm

perkiset

A quick wrapper:

When you compile

PHP

with the SYSVSEM library (a switch at compile time) you have access to the kernel's semaphore and mutex handles.

In case you are light here, a mutex is system-wide handle that is either owned or not at any given time. You request ownership and are granted it if no other process has it. If it is already owned, then you can hang and wait (you'll be placed in a queue for ownership) and when your function returns the handle you proceed. In this way, you know that only one process in the system at any one time is doing <something critical>. Mutexes are named, so there can be many mutexes in the system that all represent single-access critical components or code. Code using a mutex is most often called a critical section, and you enter the section requesting the mutex with a timeout of <nMS or nSecs> and if you don't get it you can either loop and wait again or fail out and handle things another way.

A semaphore is similar - it is another named resource that is system-wide and requested via a function like a mutex. A sempahore can have up to <n> owners at any given time, defined by the original creator of the semaphore. So for example, if you want up to 5 processes to have access to something BUT NO MORE then you'd use a semaphore rather than a mutex.

Sems and Mutexes do not, by themselves, really provide communication between processes - they provide process "manners." In the old world (main frame days), you'd acquire a mutex, modify a file putting some kind of string message in it, then let go of the mutex. In that way, you knew that no two processes would dick with the message file at the same time.

I used sems and mutexes a lot in my DOS development days and then in about '91/2 with Borland C++ 3.1 - it was the cleanest way to organize some problematic

programming

things. I barely ever use them now, prefering a queueing methodology that is supported nicely by databases. I also prefer using the OS kernel rather than forking for apps like you are describing, because the speed is less critical - you're going to wait LOADS of time for the user (relative to processor time), why implement the complexity of threads when you'll have that kind of wait? IMO, better to use poor mans threading (shell_exec another process) with a database queue as your messaging mechanism on that kind of app.

However, it is an excellent skill to have and interestingly, I may have the need to do something similar in the near future - I am currently analysing the strength of writing a TCP/IP based service written entirely in

PHP

which will require threading. Here is a complete example from the

PHP

code repository - I think it should spur you on a bit.

Good luck, please post your results!

http://www.perkiset.org/forum/

php

/forked_

php

_daemon_example-t474.0.html

vsloathe

Thanks as always Perk.

I might give it a shot with curl_multi so I don't have to get my hands as dirty, but the only way I've used curl_multi to date is to spawn a bunch of curl children and then just wait until they are all done. I want to take the returns asynchronously, so that I can display captchas as they are gotten. E.g. from the user's perspective I want this to *blaze*.

We shall see what I come up with.

perkiset

"Blaze" is relative - remember, the WAY WAY hugest bottleneck will be the user, waiting for the captcha and bandwidth in and out of your

mac

hine. You'd be better to focus on how to shrink those times than the trivially small bump (by comparison) you'll get by forking as opposed to execing (if you are interested in that technique). Again, forking is an outstanding tool to have in the box, but it is also can be an order of magnitude more complicated when it comes to memory management, understanding "who's doing what" and particularly debugging, which, in a multithreaded app is sometimes *monsterously* difficult - believe me on that one.

Not to throw water on the notion man, just a quick analysis of bang-for-buck on doing this. If it's about

learn

ing the technique then I am all for it. If it's a needed-soon mission critical app then I'd use another technology until I understood the dynamics of it better.

vsloathe

Yeah. As I said, I'll probably curl_multi it. There will be an initial delay as the threads are created and launched, but once the captchas start coming back to the user, he'll see nothing but rapid fire awesomeness, which is the goal.

nop_90

forks are not threads.
a fork is a seperate process so therefore does not share memory etc.
you need some sort of IPC if you want to do that, which big pain in ass.

DangerMouse

I wonder if an

AJAX

/COMET solution maybe appropriate for what your trying to achieve vsloathe ? It could be used to submit completed captchas, but also as a trigger to grab fresh ones. A simple approach could be 10 on a page, on tab out of the data entry point for number 5, request a new batch. This could possibly help with captcha time out issues aswel.

A COMET approach could mean the server is constantly requesting images, passing them to the browser as soon as they are ready.

Just a thought.

DM

vsloathe

quote author=DangerMouse link=topic=714.msg4967#msg4967 date=1200649062

I wonder if an

AJAX

A good thought. It had occurred to me as well, but my

AJAX

kung fu is weak.

Any pseudocode or examples to help me get started?

DangerMouse

quote author=vsloathe link=topic=714.msg4970#msg4970 date=1200667675

A good thought. It had occurred to me as well, but my

AJAX

kung fu is weak.

Any pseudocode or examples to help me get started?

Afraid not, I've never even attempted

AJAX

let alone its inverse COMET, sure Perk and others will have plenty of better tips than me there.

I kind of envisaged using

javascript

's event model to catch an "on tab out" or "on sumit / click" style event to launch the xmlhttpRequest - I think there are equivilant mechanisms but I've no clue how they work - Perk makes a good case for using another technique over xmlhttp here: http://www.perkiset.org/forum/

ajax

/it%E2%80%99s_time_to_dump_xmlhttprequest-t336.0.html, although I suspect cross platform/domain issues might not be that important to you here (although it could be an awsome extension of the project :mob

.

Anyways, the xmlhttpRequest would pass the solved captcha value the user entered to a

PHP

handling script; the response from this script, in whatever format, could then be parsed by the

javascript

and the browser display changed accordingly - appending any fresh available captchas to the bottom of the list in this case (probably a DOM insert child/sibling type command - again I'm not too sure).

The power would be in the

PHP

handling script which could use some kind of persistance to track exactly whats going on, and launch requests to grab new captcha images, and send captcha responses passed up from the browser accordingly. Thinking about it, you could do some quite interesting things here, dynamically calculating how many unsolved captchas need to be kept in a buffer depending on solving rate for example.

I know even less about how a COMET approach would work, I've yet to see many implementations of this technology outside of full blown web applications (google spreadsheets etc.) - but I think the gist is that the server polls the client, causing it to update the page when specified conditions are met, in this instance, when new captchas have been obtained. This could be useful to reduce the complexity of the

PHP

handler for the

AJAX

call described above, it would simply need to recieve completed captchas and forward them on the their required destination. Any fresh captchas would be posted to the browser using a seperate process. Whilst I think this is a neater solution, and definately cooler on the geek/techy scale, it may introduce more problems than its worth as I suspect there would need to be a way to choke the obtaining of new captchas where solving is slow so some form of cross script communication is still needed. Plus, iirc the server implementation is quite tricky due to system resource hogging.

It would also be cool to use XSLT to render the page, the data could then all be send and recieved in XML with ease then - again I don't know how to do this really - just know the theory :

I'm sure others will weigh in here with more practical suggestions lol, but thats my rambling Applause

worth

jammaster82

Isnt

AJAX

/COMET/XMLHttpRequest all gonna use
the same amount of bandwidth to pretty much
do the same thing, only

ajax

/comet/xmlhttprequest
has soooo many new technicalities with them but
at the bottom of all these we are all stick stuck by
the inte

rnet

and on a 36.6 dialup a simple 400 byte
http Get header request takes about a second including
the tcp/ip handling and it still takes about a second even if
your t1 to t1, for all the handshaking and blah zay blah that
goes inbetween the route - 'like when you do a tracert from ms
dos..' right? the time you MIGHT save is a few milliseconds on
the client side with

AJAX

and the page wont have to be refreshed
but its still a tcp/ip transmission and getting that request out
sooner wont matter timewise when you have to wait that second anyway....right?

I would just lean with whats simple and meta refresh has been working
for soooo many years now i would hate to have my site crash at 3 am
and have to wake up with one eye open and analyze the shit for four hours
till the client woke up only to tell him i found the error in some new
untested technology i anakin skywalker episode two'ed my way right into and lost
my freaking arm over it..... (nailed padme though, sweet!)

Applause

but that could also be, cause i am lazy to

learn

new things until someone else
breaks them and fixes that bleeding edge... Applause

DangerMouse

Yeah I don't think it will help in terms of how much bandwidth is used or how long the data will take to arrive at the browser, but I got the impression that the idea was to create a fast interface that bombarded a user with captchas to enter consistantly?

I just mentioned the approach above as it avoids browser refreshes and allows background process to obtain captchas while others are being entered. It was totally theoretical though, just something I dreamed up, don't really have any strong evidence to prove it would work.

DM

jammaster82

Hmm... HOw about loading like 5(or 'x') of them with the first page
and handling it all client side, so that by the time they get to 3 you
can ask for four more seeing as how four would take just as long as one,
economizing the motion that takes the longest (the tcpip transmission
however its done)?

just interested in this discussion, maintaining my position of not knowing anything..

vsloathe

I decided to try out the x

ajax

framework, because I'd been kicking around the idea for a while and as I mentioned earlier, my JS/

AJAX

skills are weak. Here's what I've got, with a possible request from my fellow geeks: I'm requesting a new captcha on each onkeyup event in the answer entry box, then storing the result in the HTML of the page. I like using the page's HTML as storage space, but I'm wondering if there's a better way to do it - currently it's not the best because the speed increase is only marginal, as it will go out on the last onkeyup and fetch a captcha. If the user hits enter immediately after the last onkeyup, he's not going to see much difference speed-wise. I was wondering if any of you geniuses could think up a way to make the onkeyup *only* fire for the first onkeyup event?

php

/**
* @author vsloathe
* @copyright 2008
*/

require_once("x

ajax

_core/x

ajax

AIO.inc.

php

");
require_once('class.gmailCreator.

php

');
$x

ajax

= new x

ajax

();
$x

ajax

->registerFunction("prepCap");
$x

ajax

->registerFunction("getCap");
$x

ajax

->registerFunction("doPost");
$x

ajax

->processRequest();
$x

ajax

->print

Javascript

();
echo('
<html>
<body onload="x

ajax

_getCap();">
<form onsubmit="x

ajax

_doPost(escape(postStr.value), capAnswer.value, escape(qPostStr.value), escape(capUrl.value)); return false;">
<div id="capImg"></div>
<input type="text" name="capAnswer" onkeyup="x

ajax

_prepCap()" /><br />
<input type="submit" value="Go" onclick="x

ajax

_doPost(escape(postStr.value), capAnswer.value, escape(qPostStr.value), escape(capUrl.value));" />
<input type="hidden" name="postStr" />
<input type="hidden" name="capUrl" value="0" />
<input type="hidden" name="qPostStr" value="0" />
</form>
</body>
</html>
');
function prepCap()
{
$objResponse = new x

ajax

Response();
$numThreads = 1;
$GC = new gmailCreator;
$GC->numThreads = $numThreads;
$GC->getAccountPage();
$capurls = $GC->buildPostStr();
$postStr = $GC->postStrings[0];
$objResponse->assign("capUrl","value",$capurls[0]);
$objResponse->assign("qPostStr","value", $postStr);
return $objResponse;
}
function getCap()
{
$objResponse = new x

ajax

Response();
$numThreads = 1;
$GC = new gmailCreator;
$GC->numThreads = $numThreads;
$GC->getAccountPage();
$capurls = $GC->buildPostStr();
$postStr = $GC->postStrings[0];
$objResponse->assign("capImg","innerHTML", '<img src="'.$capurls[0].'"></img>');
$objResponse->assign("postStr","value", $postStr);
return $objResponse;
}
function doPost($postStr,$capAnswer,$qPostStr,$capUrl)
{
$objResponse = new x

ajax

Response();
if($qPostStr)
{
$objResponse->assign("capImg","innerHTML", '<img src="'.urldecode($capUrl).'"></img>');
$objResponse->assign("capAnswer","value",'');
$objResponse->assign("postStr","value", $qpostStr);
}
else
{
$numThreads = 1;
$GC = new gmailCreator;
$GC->numThreads = $numThreads;
$GC->getAccountPage();
$capurls = $GC->buildPostStr();
$postStr = $GC->postStrings[0];
$objResponse->assign("capImg","innerHTML", '<img src="'.$capurls[0].'"></img>');
$objResponse->assign("capAnswer","value",'');
$objResponse->assign("postStr","value", $postStr);
}
$postStr.='&newaccountcaptcha='.$capAnswer;
$ph = popen('

php

dopost.

php

"'.$postStr.'"','r');
return $objResponse;
}
?>

Nothing too special in there as far as my intellectual property goes, so play with it all you want, but of course the real meat is in that class file I require_once in the beginning.

thedarkness

quote author=vsloathe link=topic=714.msg4977#msg4977 date=1200679471

I like using the page's HTML as storage space, but I'm wondering if there's a better way to do it - currently it's not the best because the speed increase is only marginal, as it will go out on the last onkeyup and fetch a captcha.

Use another async

AJAX

call to store the captcha image somewhere else (database, flat file) then grab it back when you need it?

Cheers,
td

perkiset

quote author=jammaster82 link=topic=714.msg4972#msg4972 date=1200673472

Isnt

AJAX

/COMET/XMLHttpRequest all gonna use
the same amount of bandwidth to pretty much
do the same thing

Absolutely not. The packet you throw with an XMLHTTPRequest or an XRPC is tiny. It is a fraction of the overhead of a full page pull, unless your pages are simply "hello world." A pull for a whole page is quite "heavy" by comparison.

VS - both my

Ajax

Requestor class and XRPC are in the

Javascript

repository - they are a lot lighter and should be really easy to understand. Writing to the server side is also almost trivial.

With what you are doing, I think an

Ajax

MO is a good plan and will work quite nicely.

nop_90

too lazy to read entire thread so i just browse.
you do not set image using

ajax

etc (or not directly)
On server side.
avoid threads no need for them.
use multicurl
have multi fetch the captchas and store them inside a queue. also add an id token

on client side.
have submit button (or what ever even u want) attached to js event.
upon pushing button, send imputed captcha to server, with captcha token id.

when

ajax

replies, it will send new token id of new captcha.
so if id=1233121231
then set image value to image?value=1233121231 that way it will automatically grab image from server and refresh.
(or u can do other schemes, only thing is that image url has to be unique)

nop_90

as trivial exercise to user.
should be able to use same server code (as in you do not change a line of code).
but if you use proper rpc you can make client in what ever language/platform you want

so u could make client lets say in wx

python

so it look very nice, but most of grunt work done on server Applause

If you use right rpc, with correct libraries should be able to be done entire client (with gui) in 50-100 lines of code.

perkiset

Nice outline Nop

Thread Categories

		Best of The Cache Home
		Search The Cache