
![]() |
perkiset
I’ve give a bit of a look at my old code and the klunky nature of it. I also have been considering the “pipelining” thread as well as error handling and funkier packets returned from the server. With that, I have a new web request class, cleverly named “webRequest2.” It allows for user agent spoofing as well as different encodings and languages. It is also more version-resilient, being able to run in
PHP4 and less, but taking advantage of the timeout feature ofPHP5 as well. NBs, I’ve also added a simpleGet function for you.Additionally, the old class had a lot of crap in it for old faulty servers I was dealing with - success on timeouts, considering a page completed if I ever saw the "</body> tag ... stuff like that. I think that these things would be better placed in a derivative class than in the fundamental class as I had - so this one is much more lean from that perspective. I have not implemented pipelining, nor looked at a notion for VSloathe of multipart packets, but the way it is built these should be minor mods. <>Properties>
<>Methods>
<>Usage> <? php$req = new webRequest2(); $content = $req->simpleGet('http://www.perkiset.org/'); $req = new webRequest2(); $req->debugMode = WRD_ECHO; $content = $req->simpleGet('http://www.braindonkey.com/2007/11/10/the-road-record-cheated-out-of-my-millions/'); print_r($req->getHeaders()); $req = new webRequest2(); $req->domain = 'blogs.pcworld.com'; $req->url = '/staffblog/archives/005885.html'; $req->debugMode = WRD_LOG; $req->debugLogFile = '/www/sites/testing/gettest.txt'; $req->debugLogClearOnDispatch = true; $req->dispatch(); echo $req->getRawResponse(); ?> Here is an example of debug output for the blog site that NBs used as an example in the old thread: Dispatch Starts Method: GET Built GET String: FinalURL: http://blogs.pcworld.com/staffblog/archives/005885.html Outbound Header: GET http://blogs.pcworld.com/staffblog/archives/005885.html HTTP/1.1 Host: blogs.pcworld.com User-Agent: Mozilla/5.0 ( Macintosh; U; PPCMacOS X; en)AppleWebKit/417.9 (KHTML, like Gecko) Safari/417.8Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.5 Accept-Encoding: Accept-Charset: ISO-8859-1,utf-8:q=0.7,*;q=0.7 Connection: close Content-Type: text/html Content-Length: 0 Default beforeExecute() Execute: Starts Execute: Sending request Execute: Request Sent GetChunk: Starts GetChunk: Received 1471 GetChunk: Received 2778 GetChunk: Received 627 GetChunk: Received 1368 GetChunk: Received 1368 GetChunk: Received 1368 GetChunk: Received 1368 GetChunk: Received 1368 GetChunk: Received 1368 GetChunk: Received 1368 GetChunk: Received 1368 GetChunk: Received 1368 GetChunk: Received 2736 GetChunk: Received 1368 GetChunk: Received 1368 GetChunk: Received 2736 GetChunk: Received 2736 GetChunk: Received 2736 GetChunk: Received 1368 GetChunk: Received 1368 GetChunk: Received 2736 GetChunk: Received 2736 GetChunk: Received 1968 GetChunk: Received 0 processHeaders: Starts processHeaders: Array Follows Array ( [HTTP/1.1 200 OK] => [Date] => Mon, 12 Nov 2007 23:31:21 GMT [Server] => Apache/1.3.27 (Unix)[Connection] => close [Content-Type] => text/html [Vary] => Accept-Encoding ) Execute: Successful Retrieve postProcess: Content length is 40891 Default onSuccess() Default afterExecute() Execute: Completes I have not yet had time to work through a solid testing suite for POST data. Also, I was using your server NBs for chunked testing and it spontaneously started sending the entire packet in one chunk rather than forcing me into many, so that symptom may rear it’s ugly head yet again. This class should be much better at handling packets like that than my previous class tho. Feedback and bugs welcome, thanks! /p <? php// CODE HAS BEEN MOVED TO NEXT POST WITH UPDATES. ?> perkiset
First Update: NBs made a great request to add some events to the execute() - so there are 4 functions that you can rewrite in a decendent class that might make sense for you:
<? php// Code has been moved to lower post with updates ?> nop_90
very nice.
regardless of what language you use, good idea to study perks code so u can see how http protocol works. (nothing like having live code to examine, rather then stupid specs) i have a few ideas where perks code may become useful ![]() perkiset
Thanks nop -
Now I have a question for anyone reading along... WTF is with WordPress? If I set the permalinks structure to anything except "default" then the URLs that I send into the server fail... if I leave it at default then I get what I expect. For example, if I request "http://www.perkiset.org/politics/?p=29" I get the correct page - if I request "http://www.perkiset.org/politics/2007/11/11/where-have-all-the-hippies-gone" with the permalinks on I get a 404. Of course, they both work perfectly in a browser. Additionally, if I have permalinks on and ask for the parameterized URL in Safari, it still works correctly, but then 404s in the class. I've just requested a bunch of stuff from a boatload of sites and the class is holding up nicely - except for permalinked WordPress. Any ideas? perkiset
Thanks NBs for working the problem with me -
As it happened, my header was slightly sloppy and virtually everything could work with it, except that WordPress's translation routines which did not understand it. Not only did I fix the header but I patched the simpleGet function to make sure that it is correct as well. <? php// Code moved yet again to another lower post ?> nutballs
cool cool it worked.
I successfully ripped your blog ![]() errr though you got some wierd chars in there., new thread about that... perkiset
<>Another Update>
Completed code follows the examples. <? php// This is an example of using simpleGet and the result code... $req = new webRequest2(); if (!$buff = $req->simpleGet('http://www.perkiset.org/forum/')) { switch($req->resultCode) { case 301: case 302: $newURL = $req->redirect; break; } } // This example will write a debug log and echo the result code of the get. $req = new webRequest2(); $req->domain = 'www.perkiset.org'; $req->url = '/politics/2007/11/11/where-have-all-the-hippies-gone/'; $req->debugMode = WRD_LOG; $req->debugLogFile = '/www/sites/testing/gettest.txt'; $req->debugLogClearOnDispatch = true; echo $req->dispatch(); ?> Here is the debug log on a page that Google wants to redirect: <? php$req = new webRequest2(); $req->debugMode = WRD_ECHO; $req->simpleGet('http://google.com/'); ?> simpleGet: Starts with [http://google.com/] Outbound Header: GET / HTTP/1.1 Host: google.com User-Agent: Mozilla/5.0 ( Macintosh; U; PPCMacOS X; en)AppleWebKit/417.9 (KHTML, like Gecko) Safari/417.8Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.5 Accept-Encoding: Accept-Charset: ISO-8859-1,utf-8:q=0.7,*;q=0.7 Connection: close Content-Type: text/html Content-Length: 0 Default beforeExecute() Execute: Starts Execute: Sending request Execute: Request Sent GetChunk: Starts GetChunk: Received 554 processHeaders: Starts processHeaders: Result Code is 301 processHeaders: Array Follows Array ( [Location] => http://www.google.com/ [Set-Cookie] => PREF=ID=cbefc888fd113739:TM=1194917676:LM=1194917676:S=BEN0G7hklu18yy-o; expires=Thu, 12-Nov-2009 01:34:36 GMT; path=/; domain=.google.com [Content-Type] => text/html [Server] => gws [Content-Length] => 219 [Date] => Tue, 13 Nov 2007 01:34:36 GMT [Connection] => Close ) reactToResultCode: Redirect To http://www.google.com/ Execute: Successful Retrieve postProcess: Content length is 219 Default onSuccess() Default afterExecute() Execute: Completes Here is the latest class: <? php// Code moved again to lower post ?> perkiset
quote author=nutballs link=topic=616.msg4175#msg4175 date=1194916829 cool cool it worked. I successfully ripped your blog ![]() Right on dood! Wait, I mean, fish off!! ![]() ![]() vsloathe
Awesome, it handles redirects. Also Perk how about the issue I was having with HTTPS? Propeller is a bitch that way.
perkiset
thanks VS... gonna see about HTTPS soon, as well as auto-cookies and pipelining, perhaps today if I have time.
perkiset
quote author=vsloathe link=topic=616.msg4194#msg4194 date=1194961700 ... handles redirects ... Well, it REPORTS redirects, it's up to you to go to them. quote author=vsloathe link=topic=616.msg4194#msg4194 date=1194961700 ... with HTTPS? See the next post... perkiset
<>Updates>
Here is the current class code: <? phpclass webRequest2 { private $socket; protected $finalURL; protected $rawContent; protected $rawHeader; protected $rawResponse; protected $chunkedLength; protected $chunkedTransfer; protected $cookies; protected $cookieStr; protected $errorFlag; protected $getList; protected $headers; protected $postList; protected $postStr; public $accept; public $charSet; public $domain; public $debugLogFile; public $debugLogClearOnDispatch; public $debugMode; public $language; public $manualPostContent; public $method; public $port; public $redirect; public $resultCode; public $timeout; public $url; public $userAgent; public $useSSL; // Protected and special functions function webRequest2() { $this->reset(); preg_match('/^([0-9])/', phpversion(), $parts);$this->ancient = ($parts[1] < '5'); $this->userAgent = 'Mozilla/5.0 ( Macintosh; U; PPCMacOS X; en)AppleWebKit/417.9 (KHTML, like Gecko) Safari/417.8';$this->accept = 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5'; $this->charSet = 'ISO-8859-1,utf-8:q=0.7,*;q=0.7'; $this->language = 'en-us,en;q=0.5'; if (!defined('WRD_OFF')) { define('WRD_OFF', 0); define('WRD_ECHO', 1); define('WRD_LOG', 2); } $this->debugMode = WRD_OFF; $this->debugLogFile = ''; $this->debugLogClearOnDispatch = true; $this->timeout = 30; $this->useSSL = false; } protected function buildCookieStr() { $cookieStr = ''; $start = true; foreach($this->cookies as $name=>$value) { if (!$start) { $cookieStr .= '; '; } $cookieStr .= "$name=$value"; $start = false; } $this->debug("Built COOKIE String: $cookieStr"); return $cookieStr; } protected function buildGetStr() { $getStr = ''; $getCount = count($this->getList); if ($getCount) { $sepStr = '?'; foreach($this->getList as $name=>$value) { $value = urlencode($value); $getStr .= "$sepStr$name=$value"; $sepStr = '&'; } } $this->debug("Built GET String: $getStr"); return $getStr; } protected function buildPostStr() { if ($this->manualPostContent) return $this->manualPostContent; $postStr = ''; $postCount = count($this->postList); if ($postCount) { $sepStr = ''; foreach($this->postList as $name=>$value) { $value = urlencode($value); $postStr .= "$sepStr$name=$value"; $sepStr = '&'; } } else { $postStr = 'No Content'; } $this->debug("Built POST String: $cookieStr"); return $postStr; } protected function buildHeader() { if ($this->method == 'GET') { $header = "GET {$this->finalURL} HTTP/1.1 "; $header .= "Host: {$this->domain} "; $header .= "User-Agent: {$this->userAgent} "; $header .= "Accept: {$this->accept} "; $header .= "Accept-Language: {$this->language} "; $header .= "Accept-Encoding: "; $header .= "Accept-Charset: {$this->charSet} "; if ($this->hasCookies()) { $header .= "Cookie: {$this->buildCookieStr()} "; } $header .= "Connection: close "; $header .= "Content-Type: text/html "; $header .= "Content-Length: 0 "; } else { $postData = $this->buildPostStr(); $requestLen = strlen($postData); $header = "POST {$this->finalURL} HTTP/1.1 "; $header .= "Host: {$this->domain} "; $header .= "User-Agent: {$this->userAgent} "; $header .= "Accept: {$this->accept} "; $header .= "Accept-Language: {$this->language} "; $header .= "Accept-Encoding: "; $header .= "Accept-Charset: {$this->charSet} "; if ($this->hasCookies()) { $header .= "Cookie: $this->buildCookieStr() "; } $header .= "Connection: close "; $header .= "Content-Type: application/x-www-form-urlencoded "; $header .= "Content-Length: $requestLen "; $header .= "$postData "; } $this->debug("Outbound Header: $header"); return $header; } protected function buildURL() { $this->finalURL = "{$this->url}{$this->buildGetStr()}"; $this->debug("FinalURL: {$this->finalURL}"); } protected function clearDebugLog() { if ($this->debugMode == WRD_LOG) { if (!$this->debugLogFile) throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile property not set'); if (file_put_contents($this->debugLogFile, '') === false) throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile cannot be written to'); } } protected function debug($msg) { switch($this->debugMode) { case WRD_OFF: return; case WRD_ECHO: echo "$msg "; break; case WRD_LOG: if (!$this->debugLogFile) throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile property not set'); if (file_put_contents($this->debugLogFile, "$msg ", FILE_APPEND) == false) throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile cannot be written to'); } } protected function execute($theHeader) { $this->beforeExecute(); $this->debug('Execute: Starts'); $this->clearHeaders(); $this->rawResponse = ''; $this->rawHeaders = ''; $this->rawContent = ''; $this->transferChunked = false; $this->chunkedLength = 0; $this->errorFlag = false; $this->resultCode = 0; $this->redirect = ''; $sslStr = ($this->useSSL) ? 'ssl://' : ''; $this->socket = @fsockopen("$sslStr{$this->domain}", $this->port, $errno, $errstr); if (!$this->socket) { $this->debug('Execute: Cannot open socket'); $this->onFailure(); $this->afterExecute(); return false; } $this->debug('Execute: Sending request'); $bytesToSend = strlen(trim($theHeader)); if (($bytesSent = fwrite($this->socket, trim($theHeader))) <> $bytesToSend) { $this->debug("Execute: Failed - Only $bytesSent of $bytesToSend sent"); $this->onFailure(); $this->afterExecute(); return false; } $this->debug('Execute: Request Sent'); $this->rawResponse = $this->getChunk(); preg_match('/^(.*) /smU', $this->rawResponse, $parts); $this->rawHeader = $parts[1]; preg_match('/ (.*)/sm', $this->rawResponse, $parts); $this->rawContent = $parts[1]; $this->processHeaders(); $this->reactToResultCode(); if ($this->chunkedTransfer) { $receivedSoFar = strlen($this->rawContent); while ($receivedSoFar < $this->chunkedLength) { $this->debug('Execute: Getting Chunked Block'); $this->rawContent .= $this->getChunk(); $receivedSoFar = strlen($this->rawContent); if ($this->errorFlag) { $this->debug('Execute: Terminating'); $this->onFailure(); $this->afterExecute(); return false; } } } $this->debug('Execute: Successful Retrieve'); $this->debug('postProcess: Content length is ' . strlen($this->rawContent)); $this->onSuccess(); $this->afterExecute(); $this->debug('Execute: Completes'); return $this->resultCode; } protected function getChunk() { $packets = array(); $this->debug('GetChunk: Starts'); while (!feof($this->socket)) { if (!$this->ancient) { stream_set_timeout($this->socket, $this->timeout); } $thisBuff = fread($this->socket, 65535); if (!$this->ancient) { $info = stream_get_meta_data($this->socket); if ($info['timed_out']) { $this->debug('getChunk: Timed Out'); $this->errorFlag = true; break; } } $this->debug('GetChunk: Received ' . strlen($thisBuff)); $packets[] = $thisBuff; } return implode('', $packets); } protected function hasCookies() { return count($this->cookies); } protected function processHeaders() { $this->debug('processHeaders: Starts'); $tempArr = explode(" ", $this->rawHeader); $ptr = 0; foreach($tempArr as $line) { if ($ptr == 0) { // Zeroth line - not a valid header, get the result code: preg_match('/([0-9]{3})/', $line, $parts); $this->resultCode = $parts[1]; $ptr++; $this->debug("processHeaders: Result Code is {$this->resultCode}"); } else { $parts = explode(': ', $line); $this->headers[$parts[0]] = $parts[1]; } } $this->debug("processHeaders: Array Follows " . print_r($this->headers, true)); $this->chunkedTransfer = preg_match('/: chunked/i', $this->rawHeader); if ($this->chunkedTransfer) { // OK - the actual content length is now going to be the first line of the content... grab it and ditch it... preg_match('/([^ ]+) (.*)/ms', $this->rawContent, $parts); $this->chunkedLength = hexdec($parts[1]); $this->rawContent = $parts[2]; $this->debug("processHeaders: Chunked Transfer - expected length is {$this->chunkedLength}"); } // If there are cookies, pull them into my cookie array... if ($this->headers['Set-Cookie']) { $temp = explode(';', $this->headers['Set-Cookie']); foreach($temp as $line) { $parts = explode('=', $line); $name = trim($parts[0]); $value = urldecode(trim($parts[1])); $this->cookies[$name] = $value; } $this->debug("processHeaders: Cookie Array Follows " . print_r($this->cookies, true)); } } protected function reactToResultCode() { // This function should be extended in the future to handle more eventualities switch($this->resultCode) { case 200: break; case 301: case 302: $this->redirect = $this->headers['Location']; $this->debug("reactToResultCode: Redirect To {$this->redirect}"); break; case 404: break; } } // Protected functions, designed to be overridden: protected function afterExecute() { $this->debug('Default afterExecute()'); // Remember to call this function if you override it... if ($this->socket) fclose($this->socket); } protected function beforeExecute() { $this->debug('Default beforeExecute()'); } protected function onFailure() { $this->debug('Default onFailure()'); } protected function onSuccess() { $this->debug('Default onSuccess()'); } // Public functions function addGetParam($varName, $varValue) { $this->getList[trim($varName)] = $varValue; $this->debug("Adding GET Param: [$varName] = [$varValue]"); } function addPostParam($varName, $varValue) { $this->postList[trim($varName)] = $varValue; $this->debug("Adding POST Param: [$varName] = [$varValue]"); } function clearCookies() { $this->cookies = array(); } function clearGetParams() { $this->getList = array(); } function clearHeaders() { $this->headers = array(); } function clearPostParams() { $this->postList = array(); } function dispatch() { if ($this->debugLogClearOnDispatch) { $this->clearDebugLog(); } $this->debug('Dispatch Starts'); $this->debug("Method: {$this->method}"); $this->buildURL(); $req = $this->buildHeader(); return $this->execute($req); } function getContent() { return $this->rawContent; } function getCookie($cookieName) { return $this->cookies[$cookieName]; } function getCookies() { return $this->cookies; } function getHeader($headerName) { return $this->headers[$headerName]; } function getHeaders() { return $this->headers; } function getLinksAll() { $ regex= <<<REGEX~<<>*a<>+href<>*=<>*['"]([^'"]*)~i REGEX;preg_match_all($ regex, $this->getContent(), $matches);return $matches[1]; } function getLinksExternal() { $out = array(); $temp = $this->getLinksAll(); foreach($temp as $link) { // If a link has http://{this->domain} in it or NO domain, then it is local... if (!preg_match("~^http<>*://{$this->domain}~", $link) and (preg_match('/^http/', $link))) $out[] = $link; } return $out; } function getLinksInternal() { $out = array(); $temp = $this->getLinksAll(); foreach($temp as $link) { // If a link has http://{this->domain} in it or NO domain, then it is local... if (preg_match("~^http<>*://{$this->domain}~", $link) or (!preg_match('/^http/', $link))) { // OK - it is an internal link, but let's do a little bit to it to help out the caller... // If the URL has http:// and no port, then kill the front end of it: if (substr(strtolower($link), 0, 5) == 'http:') { if (!preg_match('~:[0-9]{1,6}/~', $link)) { preg_match("~{$this->domain}(.*)~", $link, $matches); $link = $matches[1]; } } // If it is a relative link, then we need to take the current directory and add it // on to the front end so that the link is absolute: if (substr($link, 0, 1) <> '/') { // Get the current directory from the original url... preg_match("~{$this->domain}(/.*/)[^/]*~", $link, $matches); if (!$matches) { // There was no directory - the last send was the root, and the URL is // relative to the root... $link = "/$link"; } else { $link = "{$matches[1]}$link"; } } // Kill simple on-page positioning links... if (substr($link, 0, 1) == '#') continue; $out[] = $link; } } return $out; } function getRawHeader() { return $this->rawHeader; } function getRawResponse() { return $this->rawResponse; } function reset() { $this->rawPostData = ''; $this->url = ''; $this->domain = ''; $this->port = 80; $this->method = 'GET'; $this->getArray['__count'] = -1; $this->postArray['__count'] = -1; $this->rawResponse = ''; $this->rawHeader = ''; $this->rawContent = ''; $this->cookieStr = ''; $this->postStr = ''; $this->manualPostContent = ''; $this->clearCookies(); $this->clearGetParams(); $this->clearHeaders(); $this->clearPostParams(); } function setCookie($cookieName, $cookieValue) { $this->cookies[$cookieName] = $cookieValue; } function simpleGet($completeURL) { $this->debug("simpleGet: Starts with [$completeURL]"); if (substr(strtolower($completeURL), 0, 5) == 'https') { $this->debug("simpleGet: Using SSL"); $this->useSSL = true; $this->port = 443; preg_match('~//(.*)~', $completeURL, $parts); $completeURL = $parts[1]; } else if (substr(strtolower($completeURL), 0, 4) == 'http') { $this->useSSL = false; $this->port = 80; preg_match('~//(.*)~', $completeURL, $parts); $completeURL = $parts[1]; } preg_match('~([^/]+)(.*)~', $completeURL, $parts); $this->domain = $parts[1]; $this->url = $parts[2]; $this->finalURL = $this->url; $req = $this->buildHeader(); if ($this->execute($req)) { return $this->rawContent; } $this->debug('simpleGet Failed'); return false; } } ?> emonk
Wow mayne. That's some pretty code. Almost makes me want to move over to the OO darkside....
One suggestion that may or may not be appropriate would be a user agent randomizer to help prevent bannination on those long runs, but it might be better off this way as it's more generic. A sneaky way to do it if you wanted to would be to use IE Agents that look to have been infected with a zillion kinds of spyware by sticking some random gibberish into the UA where bonzai buddy and FunWeb usually like to hang out. nutballs
i think perk is just assuming this is a base class that handles the core functionalities of requesting a page from the tubes.
I know that I for one, intend to add in a whole bunch of extra stuff, like an agent randomizer for example. Since perk has already parameterized everything anyway, you can do this simply, by writing your own random agent generator, and then doing class->userAgent = randomagentmaker(); before you getSimple or dispatch. DangerMouse
Some nice additions there!
All the help I'm getting from this place must be paying off as when I replicated the previous webRequest class using CURL instead i've taken a similar approach! With regard to the splitting of internal and external links, I appreciate that is is unlikely, but is there the potential that the basehref tag may be set to an external domain, making relative links external? I've gone round in circles with this one when creating a little spider class recently, the use of regexhere makes it look so easy![]() DM perkiset
quote author=nutballs link=topic=616.msg4203#msg4203 date=1194993146 i think perk is just assuming this is a base class that handles the core functionalities of requesting a page from the tubes. Spot on NBs - the last class was rife with customized horsepoop rather than trim, fast and targetted. I think this class, particularly as a base, is more reusable and robust. quote author=DangerMouse link=topic=616.msg4205#msg4205 date=1194994967 With regard to the splitting of internal and external links, I appreciate that is is unlikely, but is there the potential that the basehref tag may be set to an external domain, making relative links external? Absolutely, but as you say, pretty durn unlikely. Have you actually seen this in the wild? I suppose that I could look for a basehref tag and then try to adjust accordingly, but that's pretty far out there for where I wanted to go with this class. My next thought would be a child class called "spider" which where I would add a lot of that sort of thing. Frankly, the getLinks functions are already sort of razors-edge to me, being somewhat outside the scope of a pure page requestor. From a hierarchy standpoint, I'd see myself creating different trees from this class, one branching towards spidering and the other branching towards scripted interaction with other sites. dimitry12
Could you please share your intentions what you will use WebRequest class and its descendants to? (Presuming that content is not the king
![]() DangerMouse
quote author=perkiset link=topic=616.msg4206#msg4206 date=1195002049 Have you actually seen this in the wild? Nah havent seen it anywhere, not really experienced enough to judge though. Love the sound of where your going with this - keeps me inspired to continue chipping away at my own little projects. DM perkiset
quote author=dimitry12 link=topic=616.msg4210#msg4210 date=1195032197 Could you please share your intentions what you will use WebRequest class and its descendants to? (Presuming that content is not the king ![]() Well, my first intentions are simply to shore up existing processes that have become sluggish and/or dusty and broken. Primarily my spider processes and some scripted 'bots. But I also do a lot of B2B work and the method of communication is becoming less socket-level proprietary stuff and more webrequestish - so it only makes sense to refactor old code and bring it up to date. As to what I'm spidering, or what I'm scripting, I'd prefer not to say... ![]() perkiset
<>Update>
I just checked the code for POST and there was a tiny bug which I have fixed in the code a few posts up from here. If you copied the code earlier than the time stamp on this post then you'll want to grab it again, or you can patch lne 137, which used to be: $header .= "Host: {$this->Host} ";[/pre] perkiset
<>Yet Another Update> <i>This update should not break any of your code using the class.</i>
<? phpclass webRequest2 { private $socket; protected $finalURL; protected $rawContent; protected $rawHeader; protected $rawResponse; protected $chunkedLength; protected $chunkedTransfer; protected $cookies; protected $cookieStr; protected $errorFlag; protected $getList; protected $headers; protected $postList; protected $postStr; public $accept; public $charSet; public $domain; public $debugLogFile; public $debugLogClearOnDispatch; public $debugMode; public $language; public $manualPostContent; public $method; public $port; public $postMode; public $proxy; public $redirect; public $resultCode; public $timeout; public $url; public $userAgent; public $useSSL; // Protected and special functions function webRequest2() { $this->reset(); preg_match('/^([0-9])/', phpversion(), $parts);$this->ancient = ($parts[1] < '5'); $this->userAgent = 'Mozilla/5.0 ( Macintosh; U; PPCMacOS X; en)AppleWebKit/417.9 (KHTML, like Gecko) Safari/417.8';$this->accept = 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5'; $this->charSet = 'ISO-8859-1,utf-8:q=0.7,*;q=0.7'; $this->language = 'en-us,en;q=0.5'; if (!defined('WRD_OFF')) { define('WRD_OFF', 0); define('WRD_ECHO', 1); define('WRD_LOG', 2); define('WRM_GET', 0); define('WRM_POST', 1); define('WRP_NORMAL', 0); define('WRP_MULTIPART', 1); } $this->debugMode = WRD_OFF; $this->postMode = WRP_NORMAL; $this->debugLogFile = ''; $this->debugLogClearOnDispatch = true; $this->timeout = 30; $this->useSSL = false; $this->proxy = ''; } protected function buildCookieStr() { $cookieStr = ''; $start = true; foreach($this->cookies as $name=>$value) { if (!$start) { $cookieStr .= '; '; } $cookieStr .= "$name=$value"; $start = false; } $this->debug("Built COOKIE String: $cookieStr"); return $cookieStr; } protected function buildGetStr() { $getStr = ''; $getCount = count($this->getList); if ($getCount) { $sepStr = '?'; foreach($this->getList as $name=>$value) { $value = urlencode($value); $getStr .= "$sepStr$name=$value"; $sepStr = '&'; } } $this->debug("Built GET String: $getStr"); return $getStr; } protected function buildPostStr() { if ($this->manualPostContent) return $this->manualPostContent; $postStr = ''; $postCount = count($this->postList); if ($postCount) { $sepStr = ''; foreach($this->postList as $name=>$arr) { $value = urlencode($arr['content']); $postStr .= "$sepStr$name=$value"; $sepStr = '&'; } } else { $postStr = 'No Content'; } $this->debug("Built POST String: $cookieStr"); return $postStr; } protected function buildHeader() { $header[0] = ''; // place holder for first line of header $header[] = "Host: {$this->domain}"; $header[] = "User-Agent: {$this->userAgent}"; $header[] = "Accept: {$this->accept}"; $header[] = "Accept-Language: {$this->language}"; $header[] = "Accept-Encoding: "; $header[] = "Accept-Charset: {$this->charSet}"; if ($this->hasCookies()) { $header[] = "Cookie: {$this->buildCookieStr()}"; } $header[] = "Connection: close"; $hostStr = ($this->proxy) ? "http://{$this->domain}" : ''; switch($this->method) { case 'get': case 'GET': $header[0] = "GET $hostStr{$this->finalURL} HTTP/1.1"; $header[] = ''; $header[] = "Content-Type: text/html"; $header[] = "Content-Length: 0"; $header[] = ''; break; case 'post': case 'POST': if (count($this->postList) == 0) $this->postMode = WRP_NORMAL; $header[0] = "POST $hostStr{$this->finalURL} HTTP/1.1"; switch ($this->postMode) { case WRP_NORMAL: $postData = $this->buildPostStr(); $requestLen = strlen($postData); $header[] = "Content-Type: application/x-www-form-urlencoded"; $header[] = "Content-Length: $requestLen"; $header[] = ''; $header[] = $postData; break; case WRP_MULTIPART: $boundary = time() . time(); $postData = $this->buildMultipartPostStr($boundary); $requestLen = strlen($postData); $header[] = "Content-Type: multipart/form-data; boundary=$boundary"; $header[] = "Content-Length: $requestLen"; $header[] = ''; $header[] = "$postData"; break; default: $this->debug("buildHeader: Terminal failure - unknown postMode '{$this->postMode}'"); throw new Exception("buildHeader: Terminal failure - unknown postMode '{$this->postMode}'"); break; } break; default: $this->debug("buildHeader: Terminal failure - unknown method '{$this->method}'"); throw new Exception("buildHeader: Terminal failure - unknown method '{$this->method}'"); break; } $out = implode(" ", $header); $this->debug("Outbound Header: $out"); return $out; } protected function buildMultipartPostStr($boundary) { $out = array(); foreach($this->postList as $name=>$arr) { $value = $arr['content']; $type = $arr['type']; $out[] = "--$boundary"; $out[] = "Content-Disposition: form-data; name="$name""; $out[] = "Content-type: $type"; $out[] = ''; $out[] = "$value"; } $out[] = "--$boundary--"; return implode(" ", $out); } protected function buildURL() { $this->finalURL = "{$this->url}{$this->buildGetStr()}"; $this->debug("FinalURL: {$this->finalURL}"); } protected function clearDebugLog() { if ($this->debugMode == WRD_LOG) { if (!$this->debugLogFile) throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile property not set'); if (file_put_contents($this->debugLogFile, '') === false) throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile cannot be written to'); } } protected function debug($msg) { switch($this->debugMode) { case WRD_OFF: return; case WRD_ECHO: echo "$msg "; break; case WRD_LOG: if (!$this->debugLogFile) throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile property not set'); if (file_put_contents($this->debugLogFile, "$msg ", FILE_APPEND) == false) throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile cannot be written to'); } } protected function execute($theHeader) { $this->beforeExecute(); $this->debug('Execute: Starts'); $this->clearHeaders(); $this->rawResponse = ''; $this->rawHeaders = ''; $this->rawContent = ''; $this->transferChunked = false; $this->chunkedLength = 0; $this->errorFlag = false; $this->resultCode = 0; $this->redirect = ''; $sslStr = ($this->useSSL) ? 'ssl://' : ''; $hostStr = ($this->proxy) ? "$sslStr{$this->proxy}" : "$sslStr{$this->domain}"; $this->debug("Execute: HostStr=[{$hostStr}] Port:{$this->port}"); $this->socket = @fsockopen($hostStr, $this->port, $errno, $errstr); if (!$this->socket) { $this->debug('Execute: Cannot open socket'); $this->onFailure(); $this->afterExecute(); return false; } $this->debug('Execute: Sending request'); $bytesToSend = strlen(trim($theHeader)); if (($bytesSent = fwrite($this->socket, trim($theHeader))) <> $bytesToSend) { $this->debug("Execute: Failed - Only $bytesSent of $bytesToSend sent"); $this->onFailure(); $this->afterExecute(); return false; } $this->debug('Execute: Request Sent'); $this->rawResponse = $this->getChunk(); preg_match('/^(.*) /smU', $this->rawResponse, $parts); $this->rawHeader = $parts[1]; preg_match('/ (.*)/sm', $this->rawResponse, $parts); $this->rawContent = $parts[1]; $this->processHeaders(); $this->reactToResultCode(); if ($this->chunkedTransfer) { $receivedSoFar = strlen($this->rawContent); while ($receivedSoFar < $this->chunkedLength) { $this->debug('Execute: Getting Chunked Block'); $this->rawContent .= $this->getChunk(); $receivedSoFar = strlen($this->rawContent); if ($this->errorFlag) { $this->debug('Execute: Terminating'); $this->onFailure(); $this->afterExecute(); return false; } } } $this->debug('Execute: Successful Retrieve'); $this->debug('postProcess: Content length is ' . strlen($this->rawContent)); $this->onSuccess(); $this->afterExecute(); $this->debug('Execute: Completes'); return $this->resultCode; } protected function getChunk() { $packets = array(); $this->debug('GetChunk: Starts'); while (!feof($this->socket)) { if (!$this->ancient) { stream_set_timeout($this->socket, $this->timeout); } $thisBuff = fread($this->socket, 65535); if (!$this->ancient) { $info = stream_get_meta_data($this->socket); if ($info['timed_out']) { $this->debug('getChunk: Timed Out'); $this->errorFlag = true; break; } } $this->debug('GetChunk: Received ' . strlen($thisBuff)); $packets[] = $thisBuff; } return implode('', $packets); } protected function hasCookies() { return count($this->cookies); } protected function processHeaders() { $this->debug('processHeaders: Starts'); $tempArr = explode(" ", $this->rawHeader); $ptr = 0; foreach($tempArr as $line) { if ($ptr == 0) { // Zeroth line - not a valid header, get the result code: preg_match('/([0-9]{3})/', $line, $parts); $this->resultCode = $parts[1]; $ptr++; $this->debug("processHeaders: Result Code is {$this->resultCode}"); } else { $parts = explode(': ', $line); $this->headers[$parts[0]] = $parts[1]; } } $this->debug("processHeaders: Array Follows " . print_r($this->headers, true)); $this->chunkedTransfer = preg_match('/: chunked/i', $this->rawHeader); if ($this->chunkedTransfer) { // OK - the actual content length is now going to be the first line of the content... grab it and ditch it... preg_match('/([^ ]+) (.*)/ms', $this->rawContent, $parts); $this->chunkedLength = hexdec($parts[1]); $this->rawContent = $parts[2]; $this->debug("processHeaders: Chunked Transfer - expected length is {$this->chunkedLength}"); } // If there are cookies, pull them into my cookie array... if ($this->headers['Set-Cookie']) { $temp = explode(';', $this->headers['Set-Cookie']); foreach($temp as $line) { $parts = explode('=', $line); $name = trim($parts[0]); $value = urldecode(trim($parts[1])); $this->cookies[$name] = $value; } $this->debug("processHeaders: Cookie Array Follows " . print_r($this->cookies, true)); } } protected function reactToResultCode() { // This function should be extended in the future to handle more eventualities switch($this->resultCode) { case 200: break; case 301: case 302: $this->redirect = $this->headers['Location']; $this->debug("reactToResultCode: Redirect To {$this->redirect}"); break; case 404: break; } } // Protected functions, designed to be overridden: protected function afterExecute() { $this->debug('Default afterExecute()'); // Remember to call this function if you override it... if ($this->socket) fclose($this->socket); } protected function beforeExecute() { $this->debug('Default beforeExecute()'); } protected function onFailure() { $this->debug('Default onFailure()'); } protected function onSuccess() { $this->debug('Default onSuccess()'); } // Public functions function addGetParam($varName, $varValue) { $this->getList[trim($varName)] = $varValue; $this->debug("Adding GET Param: [$varName] = [$varValue]"); } function addPostParam($varName, $varValue, $type='text/plain') { $varName = trim($varName); $this->postList[$varName]['content'] = $varValue; $this->postList[$varName]['type'] = $type; $this->debug("Adding POST Param: [$varName] = [$varValue]"); } function clearCookies() { $this->cookies = array(); } function clearGetParams() { $this->getList = array(); } function clearHeaders() { $this->headers = array(); } function clearPostParams() { $this->postList = array(); } function dispatch() { if ($this->debugLogClearOnDispatch) { $this->clearDebugLog(); } $this->debug('Dispatch Starts'); $this->debug("Method: {$this->method}"); $this->buildURL(); $req = $this->buildHeader(); return $this->execute($req); } function getContent() { return $this->rawContent; } function getCookie($cookieName) { return $this->cookies[$cookieName]; } function getCookies() { return $this->cookies; } function getHeader($headerName) { return $this->headers[$headerName]; } function getHeaders() { return $this->headers; } function getLinksAll() { $ regex= <<<REGEX~<<>*a<>+href<>*=<>*['"]([^'"]*)~i REGEX;preg_match_all($ regex, $this->getContent(), $matches);return $matches[1]; } function getLinksExternal() { $out = array(); $temp = $this->getLinksAll(); foreach($temp as $link) { // If a link has http://{this->domain} in it or NO domain, then it is local... if (!preg_match("~^http<>*://{$this->domain}~", $link) and (preg_match('/^http/', $link))) $out[] = $link; } return $out; } function getLinksInternal() { $out = array(); $temp = $this->getLinksAll(); foreach($temp as $link) { // If a link has http://{this->domain} in it or NO domain, then it is local... if (preg_match("~^http<>*://{$this->domain}~", $link) or (!preg_match('/^http/', $link))) { // OK - it is an internal link, but let's do a little bit to it to help out the caller... // If the URL has http:// and no port, then kill the front end of it: if (substr(strtolower($link), 0, 5) == 'http:') { if (!preg_match('~:[0-9]{1,6}/~', $link)) { preg_match("~{$this->domain}(.*)~", $link, $matches); $link = $matches[1]; } } // If it is a relative link, then we need to take the current directory and add it // on to the front end so that the link is absolute: if (substr($link, 0, 1) <> '/') { // Get the current directory from the original url... preg_match("~{$this->domain}(/.*/)[^/]*~", $link, $matches); if (!$matches) { // There was no directory - the last send was the root, and the URL is // relative to the root... $link = "/$link"; } else { $link = "{$matches[1]}$link"; } } // Kill simple on-page positioning links... if (substr($link, 0, 1) == '#') continue; $out[] = $link; } } return $out; } function getRawHeader() { return $this->rawHeader; } function getRawResponse() { return $this->rawResponse; } function reset() { $this->rawPostData = ''; |

Thread Categories

![]() |
![]() |
Best of The Cache Home |
![]() |
![]() |
Search The Cache |
- Ajax
- Apache & mod_rewrite
- BlackHat SEO & Web Stuff
- C/++/#, Pascal etc.
- Database Stuff
- General & Non-Technical Discussion
- General programming, learning to code
- Javascript Discussions & Code
- Linux Related
- Mac, iPhone & OS-X Stuff
- Miscellaneous
- MS Windows Related
- PERL & Python Related
- PHP: Questions & Discussion
- PHP: Techniques, Classes & Examples
- Regular Expressions
- Uncategorized Threads