perkiset

I’ve give a bit of a look at my old code and the klunky nature of it. I also have been considering the “pipelining” thread as well as error handling and funkier packets returned from the server. With that, I have a new web request class, cleverly named “webRequest2.” It allows for user agent spoofing as well as different encodings and languages. It is also more version-resilient, being able to run in

PHP

 4 and less, but taking advantage of the timeout feature of

PHP

 5 as well. NBs, I’ve also added a simpleGet function for you.

Additionally, the old class had a lot of crap in it for old faulty servers I was dealing with - success on timeouts, considering a page completed if I ever saw the "</body> tag ... stuff like that. I think that these things would be better placed in a derivative class than in the fundamental class as I had - so this one is much more lean from that perspective.

I have not implemented pipelining, nor looked at a notion for VSloathe of multipart packets, but the way it is built these should be minor mods.

<>Properties

  • accept - defaults to text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

  • charSet - defaults to ISO-8859-1,utf-8:q=0.7,*;q=0.7

  • domain - the host of where you want to send requests ie., www.perkiset.org
  • debugLogFile - Only required if you set the debug mode to WRD_LOG – this is the name of the file that will be used for logging.

  • debugLogClearOnDispatch - If debug logging, this will clear the file on every new dispatch.

  • debugMode - Defaults to WRD_OFF, if WRD_ECHO then the debug lines are spit out as they are created and if WRD_LOG then debug lines are written quietly to the debugLogFile.

  • language - Defaults to en-us,en;q=0.5

  • manualPostContent - If you want to post something and do not want the class to construct the post data for you (ie., a custom job or special data) then put it here and the class will use it instead of building a new one at dispatch.

  • method - GET or POST – currently no checking here, watch yourself! Defaults to GET.

  • port - Defaults to 80

  • timeout - Defaults to 30 seconds. Only useful if you have

    PHP

     5 – this will allow the stream to timeout and return a failure if the amount of time you have allotted for download has been exceeded.

  • url - The simple URL (no get parameters) of where you want to go ie., /index.html

  • userAgent - Defaults to Mozilla/5.0 (

    Mac

     intosh; U; PPC

    Mac

      OS X; en)

    Apple

     WebKit/417.9 (KHTML, like Gecko) Safari/417.8



<>Methods

  • addGetParam($varName, $varValue) – add a parameter that will be appended to the URL as a get parameter

  • addPostParam($varName, $varValue) – add a value that will be send up in the POST CONTENT portion of the request

  • clearCookies() – clear the cookies array

  • clearGetParams() – clear the get parameters array

  • clearPostParams() – clear the post parameters array

  • dispatch() – execute the request and return T/F if successful

  • getContent() – return the content portion of the last request

  • getCookie($varName) – return the value of cookie ($varName)

  • getCookies – return a handle to the cookies array

  • getHeader($varName) – return the value of header ($varName)

  • getHeaders() – return a handle to the headers array

  • getRawHeader() – return the raw header text

  • getRawResponse() – return the entire unparsed response

  • reset() – reset all internal variables EXCEPT for accept, charSet, language and userAgent.

  • setCookie($varName, $varValue) – set cookie(varName) to varValue

  • simpleGet($completeURL) – Pass a complete URL like http://myDomain.com/search.

    php

     ?str=viagra and I’ll use that without any of the processing normal handled in dispatch().



<>Usage

<?

php

 

$req = new webRequest2();
$content = $req->simpleGet('http://www.perkiset.org/');

$req = new webRequest2();
$req->debugMode = WRD_ECHO;
$content = $req->simpleGet('http://www.braindonkey.com/2007/11/10/the-road-record-cheated-out-of-my-millions/');
print_r($req->getHeaders());

$req = new webRequest2();
$req->domain = 'blogs.pcworld.com';
$req->url = '/staffblog/archives/005885.html';
$req->debugMode = WRD_LOG;
$req->debugLogFile = '/www/sites/testing/gettest.txt';
$req->debugLogClearOnDispatch = true;
$req->dispatch();
echo $req->getRawResponse();

?>


Here is an example of debug output for the blog site that NBs used as an example in the old thread:

Dispatch Starts
Method: GET
Built GET String:
FinalURL: http://blogs.pcworld.com/staffblog/archives/005885.html
Outbound Header:
GET http://blogs.pcworld.com/staffblog/archives/005885.html HTTP/1.1
Host: blogs.pcworld.com
User-Agent: Mozilla/5.0 (

Mac

 intosh; U; PPC

Mac

  OS X; en)

Apple

 WebKit/417.9 (KHTML, like Gecko) Safari/417.8
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding:
Accept-Charset: ISO-8859-1,utf-8:q=0.7,*;q=0.7
Connection: close

Content-Type: text/html
Content-Length: 0

Default beforeExecute()
Execute: Starts
Execute: Sending request
Execute: Request Sent
GetChunk: Starts
GetChunk: Received 1471
GetChunk: Received 2778
GetChunk: Received 627
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 2736
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 2736
GetChunk: Received 2736
GetChunk: Received 2736
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 2736
GetChunk: Received 2736
GetChunk: Received 1968
GetChunk: Received 0
processHeaders: Starts
processHeaders: Array Follows
Array
(
    [HTTP/1.1 200 OK] =>
    [Date] => Mon, 12 Nov 2007 23:31:21 GMT
    [Server] =>

Apache

 /1.3.27 (Unix)
    [Connection] => close
    [Content-Type] => text/html
    [Vary] => Accept-Encoding
)

Execute: Successful Retrieve
postProcess: Content length is 40891
Default onSuccess()
Default afterExecute()
Execute: Completes



I have not yet had time to work through a solid testing suite for POST data. Also, I was using your server NBs for chunked testing and it spontaneously started sending the entire packet in one chunk rather than forcing me into many, so that symptom may rear it’s ugly head yet again. This class should be much better at handling packets like that than my previous class tho.

Feedback and bugs welcome, thanks!

/p


<?

php

 

// CODE HAS BEEN MOVED TO NEXT POST WITH UPDATES.

?>

perkiset

First Update: NBs made a great request to add some events to the execute() - so there are 4 functions that you can rewrite in a decendent class that might make sense for you:


  • afterExecute() - this function is called at the very end of execution, regardless of success or failure.

  • beforeExecute() - this function will be called at the very start of execute(), which is AFTER url preparation and such in the dispatch call.

  • onFailure() - this function is called in the event of a failed request (execute) after all default error processing is handled but before afterExecute()

  • onSuccess() - this function is called in the event of a successful request (execute) after everything has been completed but before afterExecute().



<?

php

 

// Code has been moved to lower post with updates

?>

nop_90

very nice.
regardless of what language you use, good idea to study perks code so u can see how http protocol works.
(nothing like having live code to examine, rather then stupid specs)
i have a few ideas where perks code may become useful Applause.

perkiset

Thanks nop -

Now I have a question for anyone reading along... WTF is with WordPress?

If I set the permalinks structure to anything except "default" then the URLs that I send into the server fail... if I leave it at default then I get what I expect. For example, if I request "http://www.perkiset.org/politics/?p=29" I get the correct page - if I request "http://www.perkiset.org/politics/2007/11/11/where-have-all-the-hippies-gone" with the permalinks on I get a 404. Of course, they both work perfectly in a browser. Additionally, if I have permalinks on and ask for the parameterized URL in Safari, it still works correctly, but then 404s in the class.

I've just requested a bunch of stuff from a boatload of sites and the class is holding up nicely - except for permalinked WordPress. Any ideas?

perkiset

Thanks NBs for working the problem with me -

As it happened, my header was slightly sloppy and virtually everything could work with it, except that WordPress's translation routines which did not understand it. Not only did I fix the header but I patched the simpleGet function to make sure that it is correct as well.


<?

php

 

// Code moved yet again to another lower post

?>

nutballs

cool cool it worked.

I successfully ripped your blog Applause

errr though you got some wierd chars in there., new thread about that...

perkiset

<>Another Update

  • Change the result of dispatch() to either false if failed or the integer value of the response code ie., 200, 301, 404 etc.

  • Added 2 new properties, resultCode (which contains the same value as the return from dispatch) and redirect. If the response code is 301 or 302 then this property is filled with the new URL. If you use simpleGet then you'll get a blank content, but you can check the responseCode to see what happened (example below)

  • Added a handler function to react to the response code which, at this time, simply fills a new property "redirect" with what the request should redirect to. This will certainly get larger as more folks weigh in here.

  • I will correctly handle if you simpleGet with http: in front or not.



Completed code follows the examples.


<?

php

 

// This is an example of using simpleGet and the result code...
$req = new webRequest2();
if (!$buff = $req->simpleGet('http://www.perkiset.org/forum/'))
{
switch($req->resultCode)
{
case 301:
case 302:
$newURL = $req->redirect;
break;
}
}

// This example will write a debug log and echo the result code of the get.
$req = new webRequest2();
$req->domain = 'www.perkiset.org';
$req->url = '/politics/2007/11/11/where-have-all-the-hippies-gone/';
$req->debugMode = WRD_LOG;
$req->debugLogFile = '/www/sites/testing/gettest.txt';
$req->debugLogClearOnDispatch = true;
echo $req->dispatch();

?>


Here is the debug log on a page that Google wants to redirect:

<?

php

 
$req = new webRequest2();
$req->debugMode = WRD_ECHO;
$req->simpleGet('http://google.com/');
?>

simpleGet: Starts with [http://google.com/]
Outbound Header:
GET / HTTP/1.1
Host: google.com
User-Agent: Mozilla/5.0 (

Mac

 intosh; U; PPC

Mac

  OS X; en)

Apple

 WebKit/417.9 (KHTML, like Gecko) Safari/417.8
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding:
Accept-Charset: ISO-8859-1,utf-8:q=0.7,*;q=0.7
Connection: close

Content-Type: text/html
Content-Length: 0

Default beforeExecute()
Execute: Starts
Execute: Sending request
Execute: Request Sent
GetChunk: Starts
GetChunk: Received 554
processHeaders: Starts
processHeaders: Result Code is 301
processHeaders: Array Follows
Array
(
    [Location] => http://www.google.com/
    [Set-Cookie] => PREF=ID=cbefc888fd113739:TM=1194917676:LM=1194917676:S=BEN0G7hklu18yy-o; expires=Thu, 12-Nov-2009 01:34:36 GMT; path=/; domain=.google.com
    [Content-Type] => text/html
    [Server] => gws
    [Content-Length] => 219
    [Date] => Tue, 13 Nov 2007 01:34:36 GMT
    [Connection] => Close
)

reactToResultCode: Redirect To http://www.google.com/
Execute: Successful Retrieve
postProcess: Content length is 219
Default onSuccess()
Default afterExecute()
Execute: Completes


Here is the latest class:

<?

php

 

// Code moved again to lower post

?>

perkiset

quote author=nutballs link=topic=616.msg4175#msg4175 date=1194916829

cool cool it worked.

I successfully ripped your blog Applause


Right on dood!

Wait, I mean, fish off!!  Applause Applause

vsloathe

Awesome, it handles redirects. Also Perk how about the issue I was having with HTTPS? Propeller is a bitch that way.

perkiset

thanks VS... gonna see about HTTPS soon, as well as auto-cookies and pipelining, perhaps today if I have time.

perkiset

quote author=vsloathe link=topic=616.msg4194#msg4194 date=1194961700

... handles redirects ...

Well, it REPORTS redirects, it's up to you to go to them.

quote author=vsloathe link=topic=616.msg4194#msg4194 date=1194961700

... with HTTPS?

See the next post...

perkiset

<>Updates

  • HTTPS now supported in both normal mode and the simpleGet function. Either set the property useSSL to true (it's false by default) when using dispatch()  (you'll also need to set the port to 443) or simply pass a secure URL in the simpleGet function ie., simpleGet('https://www.adomain.com/afile.html') and the class will set useSSL and the port correctly.

  • Auto-cookies: Whenever a URL is retrieved and there are cookies in the header, they are automatically parsed into the cookies array - so if you went right back out to the same domain the cookies would be sent up automatically for you, like a surfers' browser would.

  • getLinksAll() this function will return all links on the scraped page in their on-page format ie., no formatting or modification is done to them.

  • getLinksExternal() - this function will return all links on the scraped page that point to a domain other than what was requested in the URL.

  • getLinksInternal() - this function gets all the internal links on the scraped page, and does a bit of work to them. If they are secure links, they are unchanged. If they have a different port number, then are unchanged. If they are a simple hard link (ie., http://www.thedomain.com/index.html) then the resulting link will be /index.html. If they are a relative URL, ie., "articles/myarticle.html" then the directory structure of the called URL will be used to rebuild an absolute link from the relative link. So for example, if the last page called was '/dir1/dir2/index.html' and the relative link is 'articles/index.html' then the resulting absolute link returned would be '/dir1/dir2/articles/index.html.'



Here is the current class code:

<?

php

 

class webRequest2
{
private $socket;

protected $finalURL;
protected $rawContent;
protected $rawHeader;
protected $rawResponse;

protected $chunkedLength;
protected $chunkedTransfer;
protected $cookies;
protected $cookieStr;
protected $errorFlag;
protected $getList;
protected $headers;
protected $postList;
protected $postStr;

public $accept;
public $charSet;
public $domain;
public $debugLogFile;
public $debugLogClearOnDispatch;
public $debugMode;
public $language;
public $manualPostContent;
public $method;
public $port;
public $redirect;
public $resultCode;
public $timeout;
public $url;
public $userAgent;
public $useSSL;

// Protected and special functions
function webRequest2()
{
$this->reset();
preg_match('/^([0-9])/',

php

 version(), $parts);
$this->ancient = ($parts[1] < '5');
$this->userAgent = 'Mozilla/5.0 (

Mac

 intosh; U; PPC

Mac

  OS X; en)

Apple

 WebKit/417.9 (KHTML, like Gecko) Safari/417.8';
$this->accept = 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$this->charSet = 'ISO-8859-1,utf-8:q=0.7,*;q=0.7';
$this->language = 'en-us,en;q=0.5';

if (!defined('WRD_OFF'))
{
define('WRD_OFF', 0);
define('WRD_ECHO', 1);
define('WRD_LOG', 2);
}

$this->debugMode = WRD_OFF;
$this->debugLogFile = '';
$this->debugLogClearOnDispatch = true;
$this->timeout = 30;
$this->useSSL = false;
}

protected function buildCookieStr()
{
$cookieStr = '';
$start = true;
foreach($this->cookies as $name=>$value)
{
if (!$start) { $cookieStr .= '; '; }
$cookieStr .= "$name=$value";
$start = false;
}
$this->debug("Built COOKIE String: $cookieStr");
return $cookieStr;
}

protected function buildGetStr()
{
$getStr = '';
$getCount = count($this->getList);
if ($getCount)
{
$sepStr = '?';
foreach($this->getList as $name=>$value)
{
$value = urlencode($value);
$getStr .= "$sepStr$name=$value";
$sepStr = '&';
}
}
$this->debug("Built GET String: $getStr");
return $getStr;
}

protected function buildPostStr()
{
if ($this->manualPostContent)
return $this->manualPostContent;

$postStr = '';
$postCount = count($this->postList);
if ($postCount)
{
$sepStr = '';
foreach($this->postList as $name=>$value)
{
$value = urlencode($value);
$postStr .= "$sepStr$name=$value";
$sepStr = '&';
}
} else {
$postStr = 'No Content';
}
$this->debug("Built POST String: $cookieStr");
return $postStr;
}

protected function buildHeader()
{
if ($this->method == 'GET')
{
$header = "GET {$this->finalURL} HTTP/1.1 ";
$header .= "Host: {$this->domain} ";
$header .= "User-Agent: {$this->userAgent} ";
$header .= "Accept: {$this->accept} ";
$header .= "Accept-Language: {$this->language} ";
$header .= "Accept-Encoding: ";
$header .= "Accept-Charset: {$this->charSet} ";
if ($this->hasCookies()) { $header .= "Cookie: {$this->buildCookieStr()} "; }
$header .= "Connection: close ";
$header .= "Content-Type: text/html ";
$header .= "Content-Length: 0 ";
} else {
$postData = $this->buildPostStr();
$requestLen = strlen($postData);
$header = "POST {$this->finalURL} HTTP/1.1 ";
$header .= "Host: {$this->domain} ";
$header .= "User-Agent: {$this->userAgent} ";
$header .= "Accept: {$this->accept} ";
$header .= "Accept-Language: {$this->language} ";
$header .= "Accept-Encoding: ";
$header .= "Accept-Charset: {$this->charSet} ";
if ($this->hasCookies()) { $header .= "Cookie: $this->buildCookieStr() "; }
$header .= "Connection: close ";
$header .= "Content-Type: application/x-www-form-urlencoded ";
$header .= "Content-Length: $requestLen ";
$header .= "$postData ";
}
$this->debug("Outbound Header: $header");
return $header;
}

protected function buildURL()
{
$this->finalURL = "{$this->url}{$this->buildGetStr()}";
$this->debug("FinalURL: {$this->finalURL}");
}

protected function clearDebugLog()
{
if ($this->debugMode == WRD_LOG)
{
if (!$this->debugLogFile)
throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile property not set');
if (file_put_contents($this->debugLogFile, '') === false)
throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile cannot be written to');
}
}

protected function debug($msg)
{
switch($this->debugMode)
{
case WRD_OFF:
return;

case WRD_ECHO:
echo "$msg ";
break;

case WRD_LOG:
if (!$this->debugLogFile)
throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile property not set');
if (file_put_contents($this->debugLogFile, "$msg ", FILE_APPEND) == false)
throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile cannot be written to');
}
}

protected function execute($theHeader)
{
$this->beforeExecute();

$this->debug('Execute: Starts');

$this->clearHeaders();
$this->rawResponse = '';
$this->rawHeaders = '';
$this->rawContent = '';
$this->transferChunked = false;
$this->chunkedLength = 0;
$this->errorFlag = false;
$this->resultCode = 0;
$this->redirect = '';

$sslStr = ($this->useSSL) ? 'ssl://' : '';
$this->socket = @fsockopen("$sslStr{$this->domain}", $this->port, $errno, $errstr);
if (!$this->socket)
{
$this->debug('Execute: Cannot open socket');
$this->onFailure();
$this->afterExecute();
return false;
}

$this->debug('Execute: Sending request');
$bytesToSend = strlen(trim($theHeader));
if (($bytesSent = fwrite($this->socket, trim($theHeader))) <> $bytesToSend)
{
$this->debug("Execute: Failed - Only $bytesSent of $bytesToSend sent");
$this->onFailure();
$this->afterExecute();
return false;
}
$this->debug('Execute: Request Sent');

$this->rawResponse = $this->getChunk();

preg_match('/^(.*) /smU', $this->rawResponse, $parts);
$this->rawHeader = $parts[1];

preg_match('/ (.*)/sm', $this->rawResponse, $parts);
$this->rawContent = $parts[1];

$this->processHeaders();
$this->reactToResultCode();

if ($this->chunkedTransfer)
{
$receivedSoFar = strlen($this->rawContent);
while ($receivedSoFar < $this->chunkedLength)
{
$this->debug('Execute: Getting Chunked Block');
$this->rawContent .= $this->getChunk();
$receivedSoFar = strlen($this->rawContent);

if ($this->errorFlag)
{
$this->debug('Execute: Terminating');
$this->onFailure();
$this->afterExecute();
return false;
}
}
}

$this->debug('Execute: Successful Retrieve');
$this->debug('postProcess: Content length is ' . strlen($this->rawContent));
$this->onSuccess();
$this->afterExecute();

$this->debug('Execute: Completes');
return $this->resultCode;

}

protected function getChunk()
{
$packets = array();
$this->debug('GetChunk: Starts');
while (!feof($this->socket))
{
if (!$this->ancient) { stream_set_timeout($this->socket, $this->timeout); }

$thisBuff = fread($this->socket, 65535);

if (!$this->ancient)
{
$info = stream_get_meta_data($this->socket);
if ($info['timed_out'])
{
$this->debug('getChunk: Timed Out');
$this->errorFlag = true;
break;
}
}

$this->debug('GetChunk: Received ' . strlen($thisBuff));
$packets[] = $thisBuff;
}
return implode('', $packets);
}

protected function hasCookies() { return count($this->cookies); }

protected function processHeaders()
{
$this->debug('processHeaders: Starts');
$tempArr = explode(" ", $this->rawHeader);
$ptr = 0;
foreach($tempArr as $line)
{
if ($ptr == 0)
{
// Zeroth line - not a valid header, get the result code:
preg_match('/([0-9]{3})/', $line, $parts);
$this->resultCode = $parts[1];
$ptr++;
$this->debug("processHeaders: Result Code is {$this->resultCode}");
} else {
$parts = explode(': ', $line);
$this->headers[$parts[0]] = $parts[1];
}
}

$this->debug("processHeaders: Array Follows " . print_r($this->headers, true));

$this->chunkedTransfer = preg_match('/: chunked/i', $this->rawHeader);
if ($this->chunkedTransfer)
{
// OK - the actual content length is now going to be the first line of the content... grab it and ditch it...
preg_match('/([^ ]+) (.*)/ms', $this->rawContent, $parts);
$this->chunkedLength = hexdec($parts[1]);
$this->rawContent = $parts[2];
$this->debug("processHeaders: Chunked Transfer - expected length is {$this->chunkedLength}");
}

// If there are cookies, pull them into my cookie array...
if ($this->headers['Set-Cookie'])
{
$temp = explode(';', $this->headers['Set-Cookie']);
foreach($temp as $line)
{
$parts = explode('=', $line);
$name = trim($parts[0]);
$value = urldecode(trim($parts[1]));
$this->cookies[$name] = $value;
}
$this->debug("processHeaders: Cookie Array Follows " . print_r($this->cookies, true));
}
}

protected function reactToResultCode()
{
// This function should be extended in the future to handle more eventualities
switch($this->resultCode)
{
case 200:
break;

case 301:
case 302:
$this->redirect = $this->headers['Location'];
$this->debug("reactToResultCode: Redirect To {$this->redirect}");
break;

case 404:
break;

}
}



// Protected functions, designed to be overridden:
protected function afterExecute()
{
$this->debug('Default afterExecute()');
// Remember to call this function if you override it...
if ($this->socket)
fclose($this->socket);
}

protected function beforeExecute()
{
$this->debug('Default beforeExecute()');
}

protected function onFailure()
{
$this->debug('Default onFailure()');
}

protected function onSuccess()
{
$this->debug('Default onSuccess()');
}


// Public functions
function addGetParam($varName, $varValue)
{
$this->getList[trim($varName)] = $varValue;
$this->debug("Adding GET Param: [$varName] = [$varValue]");
}

function addPostParam($varName, $varValue)
{
$this->postList[trim($varName)] = $varValue;
$this->debug("Adding POST Param: [$varName] = [$varValue]");
}

function clearCookies() { $this->cookies = array(); }

function clearGetParams() { $this->getList = array(); }

function clearHeaders() { $this->headers = array(); }

function clearPostParams() { $this->postList = array(); }

function dispatch()
{
if ($this->debugLogClearOnDispatch) { $this->clearDebugLog(); }

$this->debug('Dispatch Starts');
$this->debug("Method: {$this->method}");

$this->buildURL();
$req = $this->buildHeader();
return $this->execute($req);
}

function getContent() { return $this->rawContent; }

function getCookie($cookieName) { return $this->cookies[$cookieName]; }

function getCookies() { return $this->cookies; }

function getHeader($headerName) { return $this->headers[$headerName]; }

function getHeaders() { return $this->headers; }

function getLinksAll()
{
$

regex

  = <<<

REGEX

 
~<<>*a<>+href<>*=<>*['"]([^'"]*)~i

REGEX

 ;
preg_match_all($

regex

 , $this->getContent(), $matches);
return $matches[1];
}

function getLinksExternal()
{
$out = array();
$temp = $this->getLinksAll();
foreach($temp as $link)
{
// If a link has http://{this->domain} in it or NO domain, then it is local...
if (!preg_match("~^http<>*://{$this->domain}~", $link) and (preg_match('/^http/', $link)))
$out[] = $link;
}
return $out;
}

function getLinksInternal()
{
$out = array();
$temp = $this->getLinksAll();
foreach($temp as $link)
{
// If a link has http://{this->domain} in it or NO domain, then it is local...
if (preg_match("~^http<>*://{$this->domain}~", $link) or (!preg_match('/^http/', $link)))
{
// OK - it is an internal link, but let's do a little bit to it to help out the caller...

// If the URL has http:// and no port, then kill the front end of it:
if (substr(strtolower($link), 0, 5) == 'http:')
{
if (!preg_match('~:[0-9]{1,6}/~', $link))
{
preg_match("~{$this->domain}(.*)~", $link, $matches);
$link = $matches[1];
}
}

// If it is a relative link, then we need to take the current directory and add it
// on to the front end so that the link is absolute:
if (substr($link, 0, 1) <> '/')
{
// Get the current directory from the original url...
preg_match("~{$this->domain}(/.*/)[^/]*~", $link, $matches);
if (!$matches)
{
// There was no directory - the last send was the root, and the URL is
// relative to the root...
$link = "/$link";
} else {
$link = "{$matches[1]}$link";
}
}

// Kill simple on-page positioning links...
if (substr($link, 0, 1) == '#')
continue;

$out[] = $link;
}
}
return $out;
}

function getRawHeader() { return $this->rawHeader; }
function getRawResponse() { return $this->rawResponse; }

function reset()
{
$this->rawPostData = '';
$this->url = '';
$this->domain = '';
$this->port = 80;
$this->method = 'GET';
$this->getArray['__count'] = -1;
$this->postArray['__count'] = -1;
$this->rawResponse = '';
$this->rawHeader = '';
$this->rawContent = '';

$this->cookieStr = '';
$this->postStr = '';
$this->manualPostContent = '';

$this->clearCookies();
$this->clearGetParams();
$this->clearHeaders();
$this->clearPostParams();
}

function setCookie($cookieName, $cookieValue) { $this->cookies[$cookieName] = $cookieValue; }

function simpleGet($completeURL)
{
$this->debug("simpleGet: Starts with [$completeURL]");

if (substr(strtolower($completeURL), 0, 5) == 'https')
{
$this->debug("simpleGet: Using SSL");
$this->useSSL = true;
$this->port = 443;
preg_match('~//(.*)~', $completeURL, $parts);
$completeURL = $parts[1];
} else if (substr(strtolower($completeURL), 0, 4) == 'http') {
$this->useSSL = false;
$this->port = 80;
preg_match('~//(.*)~', $completeURL, $parts);
$completeURL = $parts[1];
}

preg_match('~([^/]+)(.*)~', $completeURL, $parts);
$this->domain = $parts[1];
$this->url = $parts[2];
$this->finalURL = $this->url;

$req = $this->buildHeader();
if ($this->execute($req)) { return $this->rawContent; }

$this->debug('simpleGet Failed');
return false;

}
}

?>

emonk

Wow mayne. That's some pretty code. Almost makes me want to move over to the OO darkside....

One suggestion that may or may not be appropriate would be a user agent randomizer to help prevent bannination on those long runs, but it might be better off this way as it's more generic.

A sneaky way to do it if you wanted to would be to use IE Agents that look to have been infected with a zillion kinds of spyware by sticking some random gibberish into the UA where bonzai buddy and FunWeb usually like to hang out.

nutballs

i think perk is just assuming this is a base class that handles the core functionalities of requesting a page from the tubes.

I know that I for one, intend to add in a whole bunch of extra stuff, like an agent randomizer for example. Since perk has already parameterized everything anyway, you can do this simply, by writing your own random agent generator, and then doing class->userAgent = randomagentmaker(); before you getSimple or dispatch.

DangerMouse

Some nice additions there!

All the help I'm getting from this place must be paying off as when I replicated the previous webRequest class using CURL instead i've taken a similar approach!

With regard to the splitting of internal and external links, I appreciate that is is unlikely, but is there the potential that the basehref tag may be set to an external domain, making relative links external? I've gone round in circles with this one when creating a little spider class recently, the use of

regex

  here makes it look so easy  Applause - back to the drawing board for me.

DM

perkiset

quote author=nutballs link=topic=616.msg4203#msg4203 date=1194993146

i think perk is just assuming this is a base class that handles the core functionalities of requesting a page from the tubes.

Spot on NBs - the last class was rife with customized horsepoop rather than trim, fast and targetted. I think this class, particularly as a base, is more reusable and robust.


quote author=DangerMouse link=topic=616.msg4205#msg4205 date=1194994967

With regard to the splitting of internal and external links, I appreciate that is is unlikely, but is there the potential that the basehref tag may be set to an external domain, making relative links external?

Absolutely, but as you say, pretty durn unlikely. Have you actually seen this in the wild? I suppose that I could look for a basehref tag and then try to adjust accordingly, but that's pretty far out there for where I wanted to go with this class. My next thought would be a child class called "spider" which where I would add a lot of that sort of thing. Frankly, the getLinks functions are already sort of razors-edge to me, being somewhat outside the scope of a pure page requestor. From a hierarchy standpoint, I'd see myself creating different trees from this class, one branching towards spidering and the other branching towards scripted interaction with other sites.

dimitry12

Could you please share your intentions what you will use WebRequest class and its descendants to? (Presuming that content is not the king Applause )

DangerMouse

quote author=perkiset link=topic=616.msg4206#msg4206 date=1195002049

Have you actually seen this in the wild?


Nah havent seen it anywhere, not really experienced enough to judge though.

Love the sound of where your going with this - keeps me inspired to continue chipping away at my own little projects.

DM

perkiset

quote author=dimitry12 link=topic=616.msg4210#msg4210 date=1195032197

Could you please share your intentions what you will use WebRequest class and its descendants to? (Presuming that content is not the king Applause )


Well, my first intentions are simply to shore up existing processes that have become sluggish and/or dusty and broken. Primarily my spider processes and some scripted 'bots. But I also do a lot of B2B work and the method of communication is becoming less socket-level proprietary stuff and more webrequestish - so it only makes sense to refactor old code and bring it up to date.

As to what I'm spidering, or what I'm scripting, I'd prefer not to say... Applause

perkiset

<>Update
I just checked the code for POST and there was a tiny bug which I have fixed in the code a few posts up from here. If you copied the code earlier than the time stamp on this post then you'll want to grab it again, or you can patch lne 137, which used to be:

$header .= "Host: {$this->Host}
";[/pre]

... and it needs to be...

$header .= "Host: {$this->domain}
";[/pre]

Other than that, POSTing seems to be fine.

/p

perkiset

<>Yet Another Update
<i>This update should not break any of your code using the class.</i>

  • Added support for using proxies. Simply by putting an address in the $class->proxy property it will route requests through the proxy and appropriately update the header. Note that this is still in BETA - I think there'll be a need for a retries property or something, but am looking for feedback. Remember to set the $class->port property if the proxy is not on 80.

  • Experimental, need assistance here: Added preliminary support for multipart form submission. Currently it supports only text/plain type fields - but with a little tweaking it should support all forms of MIME uploadable data. The optional 3rd parameter on addPostParam is where it goes - so you can optionally say addPostParam('varname', 'varvalue', '[a content type]') - my intention is to correctly set up the disposition and encoding based on the type yoy specify. VSloathe, if you're reading this, I can use your testing...

  • Rebuilt how the header is created and a couple other functions to be quicker and more robust for more coming modifications.





<?

php

 

class webRequest2
{
private $socket;

protected $finalURL;
protected $rawContent;
protected $rawHeader;
protected $rawResponse;

protected $chunkedLength;
protected $chunkedTransfer;
protected $cookies;
protected $cookieStr;
protected $errorFlag;
protected $getList;
protected $headers;
protected $postList;
protected $postStr;

public $accept;
public $charSet;
public $domain;
public $debugLogFile;
public $debugLogClearOnDispatch;
public $debugMode;
public $language;
public $manualPostContent;
public $method;
public $port;
public $postMode;
public $proxy;
public $redirect;
public $resultCode;
public $timeout;
public $url;
public $userAgent;
public $useSSL;

// Protected and special functions
function webRequest2()
{
$this->reset();
preg_match('/^([0-9])/',

php

 version(), $parts);
$this->ancient = ($parts[1] < '5');
$this->userAgent = 'Mozilla/5.0 (

Mac

 intosh; U; PPC

Mac

  OS X; en)

Apple

 WebKit/417.9 (KHTML, like Gecko) Safari/417.8';
$this->accept = 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$this->charSet = 'ISO-8859-1,utf-8:q=0.7,*;q=0.7';
$this->language = 'en-us,en;q=0.5';

if (!defined('WRD_OFF'))
{
define('WRD_OFF', 0);
define('WRD_ECHO', 1);
define('WRD_LOG', 2);

define('WRM_GET', 0);
define('WRM_POST', 1);

define('WRP_NORMAL', 0);
define('WRP_MULTIPART', 1);
}

$this->debugMode = WRD_OFF;
$this->postMode = WRP_NORMAL;
$this->debugLogFile = '';
$this->debugLogClearOnDispatch = true;
$this->timeout = 30;
$this->useSSL = false;
$this->proxy = '';
}

protected function buildCookieStr()
{
$cookieStr = '';
$start = true;
foreach($this->cookies as $name=>$value)
{
if (!$start) { $cookieStr .= '; '; }
$cookieStr .= "$name=$value";
$start = false;
}
$this->debug("Built COOKIE String: $cookieStr");
return $cookieStr;
}

protected function buildGetStr()
{
$getStr = '';
$getCount = count($this->getList);
if ($getCount)
{
$sepStr = '?';
foreach($this->getList as $name=>$value)
{
$value = urlencode($value);
$getStr .= "$sepStr$name=$value";
$sepStr = '&';
}
}
$this->debug("Built GET String: $getStr");
return $getStr;
}

protected function buildPostStr()
{
if ($this->manualPostContent)
return $this->manualPostContent;

$postStr = '';
$postCount = count($this->postList);
if ($postCount)
{
$sepStr = '';
foreach($this->postList as $name=>$arr)
{
$value = urlencode($arr['content']);
$postStr .= "$sepStr$name=$value";
$sepStr = '&';
}
} else {
$postStr = 'No Content';
}
$this->debug("Built POST String: $cookieStr");
return $postStr;
}

protected function buildHeader()
{

$header[0] = ''; // place holder for first line of header
$header[] = "Host: {$this->domain}";
$header[] = "User-Agent: {$this->userAgent}";
$header[] = "Accept: {$this->accept}";
$header[] = "Accept-Language: {$this->language}";
$header[] = "Accept-Encoding: ";
$header[] = "Accept-Charset: {$this->charSet}";
if ($this->hasCookies()) { $header[] = "Cookie: {$this->buildCookieStr()}"; }
$header[] = "Connection: close";

$hostStr = ($this->proxy) ? "http://{$this->domain}" : '';
switch($this->method)
{
case 'get':
case 'GET':
$header[0] = "GET $hostStr{$this->finalURL} HTTP/1.1";
$header[] = '';
$header[] = "Content-Type: text/html";
$header[] = "Content-Length: 0";
$header[] = '';
break;

case 'post':
case 'POST':
if (count($this->postList) == 0) $this->postMode = WRP_NORMAL;

$header[0] = "POST $hostStr{$this->finalURL} HTTP/1.1";
switch ($this->postMode)
{
case WRP_NORMAL:
$postData = $this->buildPostStr();
$requestLen = strlen($postData);
$header[] = "Content-Type: application/x-www-form-urlencoded";
$header[] = "Content-Length: $requestLen";
$header[] = '';
$header[] = $postData;
break;

case WRP_MULTIPART:
$boundary = time() . time();
$postData = $this->buildMultipartPostStr($boundary);
$requestLen = strlen($postData);
$header[] = "Content-Type: multipart/form-data; boundary=$boundary";
$header[] = "Content-Length: $requestLen";
$header[] = '';
$header[] = "$postData";
break;

default:
$this->debug("buildHeader: Terminal failure - unknown postMode '{$this->postMode}'");
throw new Exception("buildHeader: Terminal failure - unknown postMode '{$this->postMode}'");
break;
}
break;

default:
$this->debug("buildHeader: Terminal failure - unknown method '{$this->method}'");
throw new Exception("buildHeader: Terminal failure - unknown method '{$this->method}'");
break;
}

$out = implode(" ", $header);
$this->debug("Outbound Header: $out");
return $out;
}

protected function buildMultipartPostStr($boundary)
{
$out = array();
foreach($this->postList as $name=>$arr)
{
$value = $arr['content'];
$type = $arr['type'];
$out[] = "--$boundary";
$out[] = "Content-Disposition: form-data; name="$name"";
$out[] = "Content-type: $type";
$out[] = '';
$out[] = "$value";
}

$out[] = "--$boundary--";

return implode(" ", $out);
}

protected function buildURL()
{
$this->finalURL = "{$this->url}{$this->buildGetStr()}";
$this->debug("FinalURL: {$this->finalURL}");
}

protected function clearDebugLog()
{
if ($this->debugMode == WRD_LOG)
{
if (!$this->debugLogFile)
throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile property not set');
if (file_put_contents($this->debugLogFile, '') === false)
throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile cannot be written to');
}
}

protected function debug($msg)
{
switch($this->debugMode)
{
case WRD_OFF:
return;

case WRD_ECHO:
echo "$msg ";
break;

case WRD_LOG:
if (!$this->debugLogFile)
throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile property not set');
if (file_put_contents($this->debugLogFile, "$msg ", FILE_APPEND) == false)
throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile cannot be written to');
}
}

protected function execute($theHeader)
{
$this->beforeExecute();

$this->debug('Execute: Starts');

$this->clearHeaders();
$this->rawResponse = '';
$this->rawHeaders = '';
$this->rawContent = '';
$this->transferChunked = false;
$this->chunkedLength = 0;
$this->errorFlag = false;
$this->resultCode = 0;
$this->redirect = '';

$sslStr = ($this->useSSL) ? 'ssl://' : '';
$hostStr = ($this->proxy) ? "$sslStr{$this->proxy}" : "$sslStr{$this->domain}";
$this->debug("Execute: HostStr=[{$hostStr}] Port:{$this->port}");
$this->socket = @fsockopen($hostStr, $this->port, $errno, $errstr);
if (!$this->socket)
{
$this->debug('Execute: Cannot open socket');
$this->onFailure();
$this->afterExecute();
return false;
}

$this->debug('Execute: Sending request');
$bytesToSend = strlen(trim($theHeader));
if (($bytesSent = fwrite($this->socket, trim($theHeader))) <> $bytesToSend)
{
$this->debug("Execute: Failed - Only $bytesSent of $bytesToSend sent");
$this->onFailure();
$this->afterExecute();
return false;
}
$this->debug('Execute: Request Sent');

$this->rawResponse = $this->getChunk();

preg_match('/^(.*) /smU', $this->rawResponse, $parts);
$this->rawHeader = $parts[1];

preg_match('/ (.*)/sm', $this->rawResponse, $parts);
$this->rawContent = $parts[1];

$this->processHeaders();
$this->reactToResultCode();

if ($this->chunkedTransfer)
{
$receivedSoFar = strlen($this->rawContent);
while ($receivedSoFar < $this->chunkedLength)
{
$this->debug('Execute: Getting Chunked Block');
$this->rawContent .= $this->getChunk();
$receivedSoFar = strlen($this->rawContent);

if ($this->errorFlag)
{
$this->debug('Execute: Terminating');
$this->onFailure();
$this->afterExecute();
return false;
}
}
}

$this->debug('Execute: Successful Retrieve');
$this->debug('postProcess: Content length is ' . strlen($this->rawContent));
$this->onSuccess();
$this->afterExecute();

$this->debug('Execute: Completes');
return $this->resultCode;

}

protected function getChunk()
{
$packets = array();
$this->debug('GetChunk: Starts');
while (!feof($this->socket))
{
if (!$this->ancient) { stream_set_timeout($this->socket, $this->timeout); }

$thisBuff = fread($this->socket, 65535);

if (!$this->ancient)
{
$info = stream_get_meta_data($this->socket);
if ($info['timed_out'])
{
$this->debug('getChunk: Timed Out');
$this->errorFlag = true;
break;
}
}

$this->debug('GetChunk: Received ' . strlen($thisBuff));
$packets[] = $thisBuff;
}
return implode('', $packets);
}

protected function hasCookies() { return count($this->cookies); }

protected function processHeaders()
{
$this->debug('processHeaders: Starts');
$tempArr = explode(" ", $this->rawHeader);
$ptr = 0;
foreach($tempArr as $line)
{
if ($ptr == 0)
{
// Zeroth line - not a valid header, get the result code:
preg_match('/([0-9]{3})/', $line, $parts);
$this->resultCode = $parts[1];
$ptr++;
$this->debug("processHeaders: Result Code is {$this->resultCode}");
} else {
$parts = explode(': ', $line);
$this->headers[$parts[0]] = $parts[1];
}
}

$this->debug("processHeaders: Array Follows " . print_r($this->headers, true));

$this->chunkedTransfer = preg_match('/: chunked/i', $this->rawHeader);
if ($this->chunkedTransfer)
{
// OK - the actual content length is now going to be the first line of the content... grab it and ditch it...
preg_match('/([^ ]+) (.*)/ms', $this->rawContent, $parts);
$this->chunkedLength = hexdec($parts[1]);
$this->rawContent = $parts[2];
$this->debug("processHeaders: Chunked Transfer - expected length is {$this->chunkedLength}");
}

// If there are cookies, pull them into my cookie array...
if ($this->headers['Set-Cookie'])
{
$temp = explode(';', $this->headers['Set-Cookie']);
foreach($temp as $line)
{
$parts = explode('=', $line);
$name = trim($parts[0]);
$value = urldecode(trim($parts[1]));
$this->cookies[$name] = $value;
}
$this->debug("processHeaders: Cookie Array Follows " . print_r($this->cookies, true));
}
}

protected function reactToResultCode()
{
// This function should be extended in the future to handle more eventualities
switch($this->resultCode)
{
case 200:
break;

case 301:
case 302:
$this->redirect = $this->headers['Location'];
$this->debug("reactToResultCode: Redirect To {$this->redirect}");
break;

case 404:
break;

}
}



// Protected functions, designed to be overridden:
protected function afterExecute()
{
$this->debug('Default afterExecute()');
// Remember to call this function if you override it...
if ($this->socket)
fclose($this->socket);
}

protected function beforeExecute()
{
$this->debug('Default beforeExecute()');
}

protected function onFailure()
{
$this->debug('Default onFailure()');
}

protected function onSuccess()
{
$this->debug('Default onSuccess()');
}


// Public functions
function addGetParam($varName, $varValue)
{
$this->getList[trim($varName)] = $varValue;
$this->debug("Adding GET Param: [$varName] = [$varValue]");
}

function addPostParam($varName, $varValue, $type='text/plain')
{
$varName = trim($varName);
$this->postList[$varName]['content'] = $varValue;
$this->postList[$varName]['type'] = $type;
$this->debug("Adding POST Param: [$varName] = [$varValue]");
}

function clearCookies() { $this->cookies = array(); }

function clearGetParams() { $this->getList = array(); }

function clearHeaders() { $this->headers = array(); }

function clearPostParams() { $this->postList = array(); }

function dispatch()
{
if ($this->debugLogClearOnDispatch) { $this->clearDebugLog(); }

$this->debug('Dispatch Starts');
$this->debug("Method: {$this->method}");

$this->buildURL();
$req = $this->buildHeader();
return $this->execute($req);
}

function getContent() { return $this->rawContent; }

function getCookie($cookieName) { return $this->cookies[$cookieName]; }

function getCookies() { return $this->cookies; }

function getHeader($headerName) { return $this->headers[$headerName]; }

function getHeaders() { return $this->headers; }

function getLinksAll()
{
$

regex

  = <<<

REGEX

 
~<<>*a<>+href<>*=<>*['"]([^'"]*)~i

REGEX

 ;
preg_match_all($

regex

 , $this->getContent(), $matches);
return $matches[1];
}

function getLinksExternal()
{
$out = array();
$temp = $this->getLinksAll();
foreach($temp as $link)
{
// If a link has http://{this->domain} in it or NO domain, then it is local...
if (!preg_match("~^http<>*://{$this->domain}~", $link) and (preg_match('/^http/', $link)))
$out[] = $link;
}
return $out;
}

function getLinksInternal()
{
$out = array();
$temp = $this->getLinksAll();
foreach($temp as $link)
{
// If a link has http://{this->domain} in it or NO domain, then it is local...
if (preg_match("~^http<>*://{$this->domain}~", $link) or (!preg_match('/^http/', $link)))
{
// OK - it is an internal link, but let's do a little bit to it to help out the caller...

// If the URL has http:// and no port, then kill the front end of it:
if (substr(strtolower($link), 0, 5) == 'http:')
{
if (!preg_match('~:[0-9]{1,6}/~', $link))
{
preg_match("~{$this->domain}(.*)~", $link, $matches);
$link = $matches[1];
}
}

// If it is a relative link, then we need to take the current directory and add it
// on to the front end so that the link is absolute:
if (substr($link, 0, 1) <> '/')
{
// Get the current directory from the original url...
preg_match("~{$this->domain}(/.*/)[^/]*~", $link, $matches);
if (!$matches)
{
// There was no directory - the last send was the root, and the URL is
// relative to the root...
$link = "/$link";
} else {
$link = "{$matches[1]}$link";
}
}

// Kill simple on-page positioning links...
if (substr($link, 0, 1) == '#')
continue;

$out[] = $link;
}
}
return $out;
}

function getRawHeader() { return $this->rawHeader; }
function getRawResponse() { return $this->rawResponse; }

function reset()
{
$this->rawPostData = '';

Perkiset's Place Home   Politics @ Perkiset's