emonk

This got ignored over at the syndk8, so I thought I'd ask for some help here. Seems like a smarter crowd in general. Advice, please?

I need to be able to pipeline my HTTP GET requests. I need to be able to do this through an HTTP Proxy server. Making a single request at a time is SLOW. Pipelining speeds up my app like a thousand percent.

I can't figure out how. I've tried using HTTP CONNECT then making the HTTP/1.1 requests. I've tried smuggling them in via an HTTP/1.0 POST. Nothing works.

Does anyone know of a way to do this?

  Here is what I have so far, but it doesn't work. Applause

function fastget($domain, $pages) {
$proxy = "10.1.1.154";
$port = "81";

$fp = fsockopen($proxy, $port, $errno, $errstr, 30);
if (!$fp) {
    echo "$errstr ($errno) ";
} else {
    //Set up our header stuff...
    $header = "Host: $domain User-Agent: " .  random_useragent() . " Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.5 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Proxy-Connection: keep-alive ";
    //The first request...
    foreach ($pages as $page) {
    $out .= "GET $page HTTP/1.1 $header";
    }   
    //Shut down our http connection...   
    $out .= "Connection: Close ";

    $postheader = "CONNECT $domain:80 HTTP/1.1 ";

    $out = $postheader . $out;

    print $out;

    fwrite($fp, $out);
    while (!feof($fp)) {
        $output .= fgets($fp, 128);
    }
    fclose($fp);
}
return $output;
print $output;
}

perkiset

Have you ever seen pipelining in this way work? I tried my butt off for a while and got nowhere, because the HTTP spec is a strict request/response mechanism (I tried all sorts of clever hacks and such).

Apache

 , for example, does not maintain a notion of an open request line so that you can continually re-request over the same pipe. Consider, for example, just the way that the headers portion of an HTTP request work: Once you send them and start sending content, you cannot send headers anymore. And there is no notion of "Page Tail" so that a new set of headers can be sent down the same pipe. This can be tested simply by tel

enet

 ting - if you open a connection and attempt to get more than one page it will not work - unless there is magic here that I am unware of (a huge possibility, BTW - I am no expert here).

My personal thing was to pipeline

ajax

  responses - I wanted to build a data flooder and I just couldn't get it to work. My eventual way was to use concurrent requests rather than trying to speed up serialized requests by leaving the pipe open.

Sorry, this was certainly not the answer you were looking for - but perhaps if you want to talk about why you need it to go so much faster we could work through some options. The truth is that the connection/handshake/header/content/close cycle is not all THAT burdened by the connection/handshake/close components, so if you need to go that much faster perhaps we should look at another solution.

/p

emonk

quote author=perkiset link=topic=613.msg4147#msg4147 date=1194717865

Have you ever seen pipelining in this way work? I tried my butt off for a while and got nowhere, because the HTTP spec is a strict request/response mechanism (I tried all sorts of clever hacks and such).

Apache

 , for example, does not maintain a notion of an open request line so that you can continually re-request over the same pipe. Consider, for example, just the way that the headers portion of an HTTP request work: Once you send them and start sending content, you cannot send headers anymore. And there is no notion of "Page Tail" so that a new set of headers can be sent down the same pipe. This can be tested simply by tel

enet

 ting - if you open a connection and attempt to get more than one page it will not work - unless there is magic here that I am unware of (a huge possibility, BTW - I am no expert here).

My personal thing was to pipeline

ajax

  responses - I wanted to build a data flooder and I just couldn't get it to work. My eventual way was to use concurrent requests rather than trying to speed up serialized requests by leaving the pipe open.

Sorry, this was certainly not the answer you were looking for - but perhaps if you want to talk about why you need it to go so much faster we could work through some options. The truth is that the connection/handshake/header/content/close cycle is not all THAT burdened by the connection/handshake/close components, so if you need to go that much faster perhaps we should look at another solution.

/p


Thanks for the well thought out response perks.

I actually got this code to work, and it sped my google SERP scraper up by something like a thousand percent. Also it kept me from getting blocked as fast. I think the big G monitors per connection instead of per request.

Pipelining only works with HTTP/1.1, not 1.0 so that was probably your problem when you tried it. Basically you just issue all your get requests in a row before the 'Connection: close'. I just can't figure out how to make it work through an HTTP proxy for some reason, which kind of makes it pointless for SERP scraping.

If you want an example of it working then just change $proxy and $port to 'www.google.com', and '80', and pass it a list of request urls. It'll scrape all ten pages of a google query as fast as it'll scrape one page with my normal curl 'connect -> request -> close' style code.

I haven't experimented to see how many requests they'll let you pipeline at once, but it seems to be a lot, and it doesn't seem to trigger the captcha at all.

perkiset

Could be the 1.0 thang - I had almost all of my stuff back-versioned to make sure I was ultimately compatible - there was also some discussion that

Ajax

  worked better with that rev rather than 1.1.

Pretty cool discovery if you're right about G and this action... and I'm gona have to try it again from a pure HTTP perspective rather than an

AJAX

  one... my goals were to stream out telemetry really fast rather than pages.

Side note - I'm wondering if the server itself has something to do with the pipe - I never tried hitting Google, and

Apache

  did not like my efforts at all. I wonder if I had something set or not set to make this happen... or what. So in that respect, I'm wondering if the proxies specifically don't want you doing that... dunno. In any case, we've clearly gone beyond my knowledge LOL...

Good luck, please post if you get more results worthy of note... this is good research.

/p


Perkiset's Place Home   Politics @ Perkiset's