The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 16, 2019, 09:14:06 PM

Login with username, password and session length


Pages: [1] 2 3 ... 5
  Print  
Author Topic: Perk's NEW WebRequest Class  (Read 27860 times)
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« on: November 12, 2007, 02:56:24 PM »

I’ve give a bit of a look at my old code and the klunky nature of it. I also have been considering the “pipelining” thread as well as error handling and funkier packets returned from the server. With that, I have a new web request class, cleverly named “webRequest2.” It allows for user agent spoofing as well as different encodings and languages. It is also more version-resilient, being able to run in PHP4 and less, but taking advantage of the timeout feature of PHP5 as well. NBs, I’ve also added a simpleGet function for you.

Additionally, the old class had a lot of crap in it for old faulty servers I was dealing with - success on timeouts, considering a page completed if I ever saw the "</body> tag ... stuff like that. I think that these things would be better placed in a derivative class than in the fundamental class as I had - so this one is much more lean from that perspective.

I have not implemented pipelining, nor looked at a notion for VSloathe of multipart packets, but the way it is built these should be minor mods.

Properties
  • accept - defaults to text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
  • charSet - defaults to ISO-8859-1,utf-8:q=0.7,*;q=0.7
  • domain - the host of where you want to send requests ie., www.perkiset.org
  • debugLogFile - Only required if you set the debug mode to WRD_LOG – this is the name of the file that will be used for logging.
  • debugLogClearOnDispatch - If debug logging, this will clear the file on every new dispatch.
  • debugMode - Defaults to WRD_OFF, if WRD_ECHO then the debug lines are spit out as they are created and if WRD_LOG then debug lines are written quietly to the debugLogFile.
  • language - Defaults to en-us,en;q=0.5
  • manualPostContent - If you want to post something and do not want the class to construct the post data for you (ie., a custom job or special data) then put it here and the class will use it instead of building a new one at dispatch.
  • method - GET or POST – currently no checking here, watch yourself! Defaults to GET.
  • port - Defaults to 80
  • timeout - Defaults to 30 seconds. Only useful if you have PHP5 – this will allow the stream to timeout and return a failure if the amount of time you have allotted for download has been exceeded.
  • url - The simple URL (no get parameters) of where you want to go ie., /index.html
  • userAgent - Defaults to Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/417.9 (KHTML, like Gecko) Safari/417.8

Methods
  • addGetParam($varName, $varValue) – add a parameter that will be appended to the URL as a get parameter
  • addPostParam($varName, $varValue) – add a value that will be send up in the POST CONTENT portion of the request
  • clearCookies() – clear the cookies array
  • clearGetParams() – clear the get parameters array
  • clearPostParams() – clear the post parameters array
  • dispatch() – execute the request and return T/F if successful
  • getContent() – return the content portion of the last request
  • getCookie($varName) – return the value of cookie ($varName)
  • getCookies – return a handle to the cookies array
  • getHeader($varName) – return the value of header ($varName)
  • getHeaders() – return a handle to the headers array
  • getRawHeader() – return the raw header text
  • getRawResponse() – return the entire unparsed response
  • reset() – reset all internal variables EXCEPT for accept, charSet, language and userAgent.
  • setCookie($varName, $varValue) – set cookie(varName) to varValue
  • simpleGet($completeURL) – Pass a complete URL like http://myDomain.com/search.php?str=viagra and I’ll use that without any of the processing normal handled in dispatch().

Usage
Code:
<?php

$req 
= new webRequest2();
$content $req->simpleGet('http://www.perkiset.org/');

$req = new webRequest2();
$req->debugMode WRD_ECHO;
$content $req->simpleGet('http://www.braindonkey.com/2007/11/10/the-road-record-cheated-out-of-my-millions/');
print_r($req->getHeaders());

$req = new webRequest2();
$req->domain 'blogs.pcworld.com';
$req->url '/staffblog/archives/005885.html';
$req->debugMode WRD_LOG;
$req->debugLogFile '/www/sites/testing/gettest.txt';
$req->debugLogClearOnDispatch true;
$req->dispatch();
echo 
$req->getRawResponse();

?>


Here is an example of debug output for the blog site that NBs used as an example in the old thread:
Code:
Dispatch Starts
Method: GET
Built GET String:
FinalURL: http://blogs.pcworld.com/staffblog/archives/005885.html
Outbound Header:
GET http://blogs.pcworld.com/staffblog/archives/005885.html HTTP/1.1
Host: blogs.pcworld.com
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/417.9 (KHTML, like Gecko) Safari/417.8
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding:
Accept-Charset: ISO-8859-1,utf-8:q=0.7,*;q=0.7
Connection: close

Content-Type: text/html
Content-Length: 0

Default beforeExecute()
Execute: Starts
Execute: Sending request
Execute: Request Sent
GetChunk: Starts
GetChunk: Received 1471
GetChunk: Received 2778
GetChunk: Received 627
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 2736
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 2736
GetChunk: Received 2736
GetChunk: Received 2736
GetChunk: Received 1368
GetChunk: Received 1368
GetChunk: Received 2736
GetChunk: Received 2736
GetChunk: Received 1968
GetChunk: Received 0
processHeaders: Starts
processHeaders: Array Follows
Array
(
    [HTTP/1.1 200 OK] =>
    [Date] => Mon, 12 Nov 2007 23:31:21 GMT
    [Server] => Apache/1.3.27 (Unix)
    [Connection] => close
    [Content-Type] => text/html
    [Vary] => Accept-Encoding
)

Execute: Successful Retrieve
postProcess: Content length is 40891
Default onSuccess()
Default afterExecute()
Execute: Completes


I have not yet had time to work through a solid testing suite for POST data. Also, I was using your server NBs for chunked testing and it spontaneously started sending the entire packet in one chunk rather than forcing me into many, so that symptom may rear it’s ugly head yet again. This class should be much better at handling packets like that than my previous class tho.

Feedback and bugs welcome, thanks!

/p

Code:
<?php

// CODE HAS BEEN MOVED TO NEXT POST WITH UPDATES.

?>

« Last Edit: November 12, 2007, 04:32:19 PM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #1 on: November 12, 2007, 04:06:07 PM »

First Update: NBs made a great request to add some events to the execute() - so there are 4 functions that you can rewrite in a decendent class that might make sense for you:

  • afterExecute() - this function is called at the very end of execution, regardless of success or failure.
  • beforeExecute() - this function will be called at the very start of execute(), which is AFTER url preparation and such in the dispatch call.
  • onFailure() - this function is called in the event of a failed request (execute) after all default error processing is handled but before afterExecute()
  • onSuccess() - this function is called in the event of a successful request (execute) after everything has been completed but before afterExecute().
Code:
<?php

// Code has been moved to lower post with updates

?>

« Last Edit: November 12, 2007, 06:03:37 PM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nop_90
Global Moderator
Lifer
*****
Offline Offline

Posts: 2203


View Profile
« Reply #2 on: November 12, 2007, 04:37:45 PM »

very nice.
regardless of what language you use, good idea to study perks code so u can see how http protocol works.
(nothing like having live code to examine, rather then stupid specs)
i have a few ideas where perks code may become useful Smiley.

Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #3 on: November 12, 2007, 05:08:32 PM »

Thanks nop -

Now I have a question for anyone reading along... WTF is with WordPress?

If I set the permalinks structure to anything except "default" then the URLs that I send into the server fail... if I leave it at default then I get what I expect. For example, if I request "http://www.perkiset.org/politics/?p=29" I get the correct page - if I request "http://www.perkiset.org/politics/2007/11/11/where-have-all-the-hippies-gone" with the permalinks on I get a 404. Of course, they both work perfectly in a browser. Additionally, if I have permalinks on and ask for the parameterized URL in Safari, it still works correctly, but then 404s in the class.

I've just requested a bunch of stuff from a boatload of sites and the class is holding up nicely - except for permalinked WordPress. Any ideas?
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #4 on: November 12, 2007, 06:02:56 PM »

Thanks NBs for working the problem with me -

As it happened, my header was slightly sloppy and virtually everything could work with it, except that WordPress's translation routines which did not understand it. Not only did I fix the header but I patched the simpleGet function to make sure that it is correct as well.

Code:
<?php

// Code moved yet again to another lower post

?>

« Last Edit: November 12, 2007, 06:32:27 PM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #5 on: November 12, 2007, 06:20:29 PM »

cool cool it worked.

I successfully ripped your blog Smiley

errr though you got some wierd chars in there., new thread about that...
« Last Edit: November 12, 2007, 06:30:30 PM by nutballs » Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #6 on: November 12, 2007, 06:31:56 PM »

Another Update
  • Change the result of dispatch() to either false if failed or the integer value of the response code ie., 200, 301, 404 etc.
  • Added 2 new properties, resultCode (which contains the same value as the return from dispatch) and redirect. If the response code is 301 or 302 then this property is filled with the new URL. If you use simpleGet then you'll get a blank content, but you can check the responseCode to see what happened (example below)
  • Added a handler function to react to the response code which, at this time, simply fills a new property "redirect" with what the request should redirect to. This will certainly get larger as more folks weigh in here.
  • I will correctly handle if you simpleGet with http: in front or not.

Completed code follows the examples.

Code:
<?php

// This is an example of using simpleGet and the result code...
$req = new webRequest2();
if (!
$buff $req->simpleGet('http://www.perkiset.org/forum/'))
{
switch($req->resultCode)
{
case 301:
case 302:
$newURL $req->redirect;
break;
}
}

// This example will write a debug log and echo the result code of the get.
$req = new webRequest2();
$req->domain 'www.perkiset.org';
$req->url '/politics/2007/11/11/where-have-all-the-hippies-gone/';
$req->debugMode WRD_LOG;
$req->debugLogFile '/www/sites/testing/gettest.txt';
$req->debugLogClearOnDispatch true;
echo 
$req->dispatch();

?>


Here is the debug log on a page that Google wants to redirect:
Code:
<?php
$req 
= new webRequest2();
$req->debugMode WRD_ECHO;
$req->simpleGet('http://google.com/');
?>


simpleGet: Starts with [http://google.com/]
Outbound Header:
GET / HTTP/1.1
Host: google.com
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/417.9 (KHTML, like Gecko) Safari/417.8
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding:
Accept-Charset: ISO-8859-1,utf-8:q=0.7,*;q=0.7
Connection: close

Content-Type: text/html
Content-Length: 0

Default beforeExecute()
Execute: Starts
Execute: Sending request
Execute: Request Sent
GetChunk: Starts
GetChunk: Received 554
processHeaders: Starts
processHeaders: Result Code is 301
processHeaders: Array Follows
Array
(
    [Location] => http://www.google.com/
    [Set-Cookie] => PREF=ID=cbefc888fd113739:TM=1194917676:LM=1194917676:S=BEN0G7hklu18yy-o; expires=Thu, 12-Nov-2009 01:34:36 GMT; path=/; domain=.google.com
    [Content-Type] => text/html
    [Server] => gws
    [Content-Length] => 219
    [Date] => Tue, 13 Nov 2007 01:34:36 GMT
    [Connection] => Close
)

reactToResultCode: Redirect To http://www.google.com/
Execute: Successful Retrieve
postProcess: Content length is 219
Default onSuccess()
Default afterExecute()
Execute: Completes

Here is the latest class:
Code:
<?php

// Code moved again to lower post

?>

« Last Edit: November 13, 2007, 12:03:59 PM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #7 on: November 12, 2007, 06:33:48 PM »

cool cool it worked.

I successfully ripped your blog Smiley

Right on dood!

Wait, I mean, fuck off!!  ROFLMAO ROFLMAO
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #8 on: November 13, 2007, 06:48:20 AM »

Awesome, it handles redirects. Also Perk how about the issue I was having with HTTPS? Propeller is a bitch that way.
Logged

hai
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #9 on: November 13, 2007, 07:51:01 AM »

thanks VS... gonna see about HTTPS soon, as well as auto-cookies and pipelining, perhaps today if I have time.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #10 on: November 13, 2007, 12:03:21 PM »

... handles redirects ...
Well, it REPORTS redirects, it's up to you to go to them.

... with HTTPS?
See the next post...
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #11 on: November 13, 2007, 12:59:18 PM »

Updates
  • HTTPS now supported in both normal mode and the simpleGet function. Either set the property useSSL to true (it's false by default) when using dispatch()  (you'll also need to set the port to 443) or simply pass a secure URL in the simpleGet function ie., simpleGet('https://www.adomain.com/afile.html') and the class will set useSSL and the port correctly.
  • Auto-cookies: Whenever a URL is retrieved and there are cookies in the header, they are automatically parsed into the cookies array - so if you went right back out to the same domain the cookies would be sent up automatically for you, like a surfers' browser would.
  • getLinksAll() this function will return all links on the scraped page in their on-page format ie., no formatting or modification is done to them.
  • getLinksExternal() - this function will return all links on the scraped page that point to a domain other than what was requested in the URL.
  • getLinksInternal() - this function gets all the internal links on the scraped page, and does a bit of work to them. If they are secure links, they are unchanged. If they have a different port number, then are unchanged. If they are a simple hard link (ie., http://www.thedomain.com/index.html) then the resulting link will be /index.html. If they are a relative URL, ie., "articles/myarticle.html" then the directory structure of the called URL will be used to rebuild an absolute link from the relative link. So for example, if the last page called was '/dir1/dir2/index.html' and the relative link is 'articles/index.html' then the resulting absolute link returned would be '/dir1/dir2/articles/index.html.'

Here is the current class code:
Code:
<?php

class webRequest2
{
private $socket;

protected $finalURL;
protected $rawContent;
protected $rawHeader;
protected $rawResponse;

protected $chunkedLength;
protected $chunkedTransfer;
protected $cookies;
protected $cookieStr;
protected $errorFlag;
protected $getList;
protected $headers;
protected $postList;
protected $postStr;

public $accept;
public $charSet;
public $domain;
public $debugLogFile;
public $debugLogClearOnDispatch;
public $debugMode;
public $language;
public $manualPostContent;
public $method;
public $port;
public $redirect;
public $resultCode;
public $timeout;
public $url;
public $userAgent;
public $useSSL;

// Protected and special functions
function webRequest2()
{
$this->reset();
preg_match('/^([0-9])/'phpversion(), $parts);
$this->ancient = ($parts[1] < '5');
$this->userAgent 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/417.9 (KHTML, like Gecko) Safari/417.8';
$this->accept 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$this->charSet 'ISO-8859-1,utf-8:q=0.7,*;q=0.7';
$this->language 'en-us,en;q=0.5';

if (!defined('WRD_OFF'))
{
define('WRD_OFF'0);
define('WRD_ECHO'1);
define('WRD_LOG'2);
}

$this->debugMode WRD_OFF;
$this->debugLogFile '';
$this->debugLogClearOnDispatch true;
$this->timeout 30;
$this->useSSL false;
}

protected function buildCookieStr()
{
$cookieStr '';
$start true;
foreach($this->cookies as $name=>$value)
{
if (!$start) { $cookieStr .= '; '; }
$cookieStr .= "$name=$value";
$start false;
}
$this->debug("Built COOKIE String: $cookieStr");
return $cookieStr;
}

protected function buildGetStr()
{
$getStr '';
$getCount count($this->getList);
if ($getCount)
{
$sepStr '?';
foreach($this->getList as $name=>$value)
{
$value urlencode($value);
$getStr .= "$sepStr$name=$value";
$sepStr '&';
}
}
$this->debug("Built GET String: $getStr");
return $getStr;
}

protected function buildPostStr()
{
if ($this->manualPostContent)
return $this->manualPostContent;

$postStr '';
$postCount count($this->postList);
if ($postCount)
{
$sepStr '';
foreach($this->postList as $name=>$value)
{
$value urlencode($value);
$postStr .= "$sepStr$name=$value";
$sepStr '&';
}
} else {
$postStr 'No Content';
}
$this->debug("Built POST String: $cookieStr");
return $postStr;
}

protected function buildHeader()
{
if ($this->method == 'GET'

$header "GET {$this->finalURL} HTTP/1.1\r\n";
$header .= "Host: {$this->domain}\r\n";
$header .= "User-Agent: {$this->userAgent}\r\n";
$header .= "Accept: {$this->accept}\r\n";
$header .= "Accept-Language: {$this->language}\r\n";
$header .= "Accept-Encoding: \r\n";
$header .= "Accept-Charset: {$this->charSet}\r\n";
if ($this->hasCookies()) { $header .= "Cookie: {$this->buildCookieStr()}\r\n"; }
$header .= "Connection: close\r\n\r\n";
$header .= "Content-Type: text/html\r\n";
$header .= "Content-Length: 0\r\n";
} else { 
$postData $this->buildPostStr();
$requestLen strlen($postData);
$header "POST {$this->finalURL} HTTP/1.1\r\n";
$header .= "Host: {$this->domain}\r\n";
$header .= "User-Agent: {$this->userAgent}\r\n";
$header .= "Accept: {$this->accept}\r\n";
$header .= "Accept-Language: {$this->language}\r\n";
$header .= "Accept-Encoding: \r\n";
$header .= "Accept-Charset: {$this->charSet}\r\n";
if ($this->hasCookies()) { $header .= "Cookie: $this->buildCookieStr()\r\n"; }
$header .= "Connection: close\r\n";
$header .= "Content-Type: application/x-www-form-urlencoded\r\n";
$header .= "Content-Length: $requestLen\r\n\r\n";
$header .= "$postData\r\n";
}
$this->debug("Outbound Header:\n$header");
return $header;
}

protected function buildURL()
{
$this->finalURL "{$this->url}{$this->buildGetStr()}";
$this->debug("FinalURL: {$this->finalURL}");
}

protected function clearDebugLog()
{
if ($this->debugMode == WRD_LOG)
{
if (!$this->debugLogFile)
throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile property not set');
if (file_put_contents($this->debugLogFile'') === false)
throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile cannot be written to');
}
}

protected function debug($msg)
{
switch($this->debugMode)
{
case WRD_OFF
return;

case WRD_ECHO
echo "$msg\n";
break;

case WRD_LOG:
if (!$this->debugLogFile)
throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile property not set');
if (file_put_contents($this->debugLogFile"$msg\n\n"FILE_APPEND) == false)
throw new Exception('webRequest2: Debug mode set to LOG, but debugLogFile cannot be written to');
}
}

protected function execute($theHeader)
{
$this->beforeExecute();

$this->debug('Execute: Starts');

$this->clearHeaders();
$this->rawResponse '';
$this->rawHeaders '';
$this->rawContent '';
$this->transferChunked false;
$this->chunkedLength 0;
$this->errorFlag false;
$this->resultCode 0;
$this->redirect '';

$sslStr = ($this->useSSL) ? 'ssl://' '';
$this->socket = @fsockopen("$sslStr{$this->domain}"$this->port$errno$errstr);
if (!$this->socket)
{
$this->debug('Execute: Cannot open socket');
$this->onFailure();
$this->afterExecute();
return false;
}

$this->debug('Execute: Sending request');
$bytesToSend strlen(trim($theHeader));
if (($bytesSent fwrite($this->sockettrim($theHeader))) <> $bytesToSend)
{
$this->debug("Execute: Failed - Only $bytesSent of $bytesToSend sent");
$this->onFailure();
$this->afterExecute();
return false;
}
$this->debug('Execute: Request Sent');

$this->rawResponse $this->getChunk();

preg_match('/^(.*)\r\n\r\n/smU'$this->rawResponse$parts);
$this->rawHeader $parts[1];

preg_match('/\r\n\r\n(.*)/sm'$this->rawResponse$parts);
$this->rawContent $parts[1];

$this->processHeaders();
$this->reactToResultCode();

if ($this->chunkedTransfer)
{
$receivedSoFar strlen($this->rawContent);
while ($receivedSoFar $this->chunkedLength)
{
$this->debug('Execute: Getting Chunked Block');
$this->rawContent .= $this->getChunk();
$receivedSoFar strlen($this->rawContent);

if ($this->errorFlag)
{
$this->debug('Execute: Terminating');
$this->onFailure();
$this->afterExecute();
return false;
}
}
}

$this->debug('Execute: Successful Retrieve');
$this->debug('postProcess: Content length is ' strlen($this->rawContent));
$this->onSuccess();
$this->afterExecute();

$this->debug('Execute: Completes');
return $this->resultCode;

}

protected function getChunk()
{
$packets = array();
$this->debug('GetChunk: Starts');
while (!feof($this->socket))
{
if (!$this->ancient) { stream_set_timeout($this->socket$this->timeout); }

$thisBuff fread($this->socket65535);

if (!$this->ancient)
{
$info stream_get_meta_data($this->socket);
if ($info['timed_out'])
{
$this->debug('getChunk: Timed Out');
$this->errorFlag true;
break;
}
}

$this->debug('GetChunk: Received ' strlen($thisBuff));
$packets[] = $thisBuff;
}
return implode(''$packets);
}

protected function hasCookies() { return count($this->cookies); }

protected function processHeaders()
{
$this->debug('processHeaders: Starts');
$tempArr explode("\r\n"$this->rawHeader);
$ptr 0;
foreach($tempArr as $line)
{
if ($ptr == 0)
{
// Zeroth line - not a valid header, get the result code:
preg_match('/([0-9]{3})/'$line$parts);
$this->resultCode $parts[1];
$ptr++;
$this->debug("processHeaders: Result Code is {$this->resultCode}");
} else {
$parts explode(': '$line);
$this->headers[$parts[0]] = $parts[1];
}
}

$this->debug("processHeaders: Array Follows\n" print_r($this->headerstrue));

$this->chunkedTransfer preg_match('/: chunked/i'$this->rawHeader);
if ($this->chunkedTransfer)
{
// OK - the actual content length is now going to be the first line of the content... grab it and ditch it...
preg_match('/([^\r]+)\r\n(.*)/ms'$this->rawContent$parts);
$this->chunkedLength hexdec($parts[1]);
$this->rawContent $parts[2];
$this->debug("processHeaders: Chunked Transfer - expected length is {$this->chunkedLength}");
}

// If there are cookies, pull them into my cookie array...
if ($this->headers['Set-Cookie'])
{
$temp explode(';'$this->headers['Set-Cookie']);
foreach($temp as $line)
{
$parts explode('='$line);
$name trim($parts[0]);
$value urldecode(trim($parts[1]));
$this->cookies[$name] = $value;
}
$this->debug("processHeaders: Cookie Array Follows\n" print_r($this->cookiestrue));
}
}

protected function reactToResultCode()
{
// This function should be extended in the future to handle more eventualities
switch($this->resultCode)
{
case 200
break;

case 301:
case 302:
$this->redirect $this->headers['Location'];
$this->debug("reactToResultCode: Redirect To {$this->redirect}");
break;

case 404:
break;

}
}



// Protected functions, designed to be overridden:
protected function afterExecute() 
{
$this->debug('Default afterExecute()');
// Remember to call this function if you override it...
if ($this->socket)
fclose($this->socket);
}

protected function beforeExecute() 
{
$this->debug('Default beforeExecute()');
}

protected function onFailure()
{
$this->debug('Default onFailure()');
}

protected function onSuccess()
{
$this->debug('Default onSuccess()');
}


// Public functions
function addGetParam($varName$varValue

$this->getList[trim($varName)] = $varValue;
$this->debug("Adding GET Param: [$varName] = [$varValue]");
}

function addPostParam($varName$varValue

$this->postList[trim($varName)] = $varValue
$this->debug("Adding POST Param: [$varName] = [$varValue]");
}

function clearCookies() { $this->cookies = array(); }

function clearGetParams() { $this->getList = array(); }

function clearHeaders() { $this->headers = array(); }

function clearPostParams() { $this->postList = array(); }

function dispatch()
{
if ($this->debugLogClearOnDispatch) { $this->clearDebugLog(); }

$this->debug('Dispatch Starts');
$this->debug("Method: {$this->method}");

$this->buildURL();
$req $this->buildHeader();
return $this->execute($req);
}

function getContent() { return $this->rawContent; }

function getCookie($cookieName) { return $this->cookies[$cookieName]; }

function getCookies() { return $this->cookies; }

function getHeader($headerName) { return $this->headers[$headerName]; }

function getHeaders() { return $this->headers; }

function getLinksAll()
{
$regex = <<<REGEX
~<[\s]*a[\s]+href[\s]*=[\s]*['"]([^'"]*)~i
REGEX;
preg_match_all($regex$this->getContent(), $matches);
return $matches[1];
}

function getLinksExternal()
{
$out = array();
$temp $this->getLinksAll();
foreach($temp as $link)
{
// If a link has http://{this->domain} in it or NO domain, then it is local...
if (!preg_match("~^http[s]*://{$this->domain}~"$link) and (preg_match('/^http/'$link)))
$out[] = $link;
}
return $out;
}

function getLinksInternal()
{
$out = array();
$temp $this->getLinksAll();
foreach($temp as $link)
{
// If a link has http://{this->domain} in it or NO domain, then it is local...
if (preg_match("~^http[s]*://{$this->domain}~"$link) or (!preg_match('/^http/'$link)))
{
// OK - it is an internal link, but let's do a little bit to it to help out the caller...

// If the URL has http:// and no port, then kill the front end of it:
if (substr(strtolower($link), 05) == 'http:')
{
if (!preg_match('~:[0-9]{1,6}/~'$link))
{
preg_match("~{$this->domain}(.*)~"$link$matches);
$link $matches[1];
}
}

// If it is a relative link, then we need to take the current directory and add it
// on to the front end so that the link is absolute:
if (substr($link01) <> '/')
{
// Get the current directory from the original url...
preg_match("~{$this->domain}(/.*/)[^/]*~"$link$matches);
if (!$matches)
{
// There was no directory - the last send was the root, and the URL is
// relative to the root...
$link "/$link";
} else {
$link "{$matches[1]}$link";
}
}

// Kill simple on-page positioning links...
if (substr($link01) == '#')
continue;

$out[] = $link;
}
}
return $out;
}

function getRawHeader() { return $this->rawHeader; }
function getRawResponse() { return $this->rawResponse; }

function reset() 
{
$this->rawPostData '';
$this->url '';
$this->domain '';
$this->port 80;
$this->method 'GET';
$this->getArray['__count'] = -1;
$this->postArray['__count'] = -1;
$this->rawResponse '';
$this->rawHeader '';
$this->rawContent '';

$this->cookieStr '';
$this->postStr '';
$this->manualPostContent '';

$this->clearCookies();
$this->clearGetParams();
$this->clearHeaders();
$this->clearPostParams();
}

function setCookie($cookieName$cookieValue) { $this->cookies[$cookieName] = $cookieValue; }

function simpleGet($completeURL)
{
$this->debug("simpleGet: Starts with [$completeURL]");

if (substr(strtolower($completeURL), 05) == 'https')
{
$this->debug("simpleGet: Using SSL");
$this->useSSL true;
$this->port 443;
preg_match('~//(.*)~'$completeURL$parts);
$completeURL $parts[1];
} else if (substr(strtolower($completeURL), 04) == 'http') {
$this->useSSL false;
$this->port 80;
preg_match('~//(.*)~'$completeURL$parts);
$completeURL $parts[1];
}

preg_match('~([^/]+)(.*)~'$completeURL$parts);
$this->domain $parts[1];
$this->url $parts[2];
$this->finalURL $this->url;

$req $this->buildHeader();
if ($this->execute($req)) { return $this->rawContent; }

$this->debug('simpleGet Failed');
return false;

}
}

?>

« Last Edit: November 14, 2007, 06:33:32 PM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
emonk
Rookie
**
Offline Offline

Posts: 44


View Profile
« Reply #12 on: November 13, 2007, 02:34:03 PM »

 Wow mayne. That's some pretty code. Almost makes me want to move over to the OO darkside....

 One suggestion that may or may not be appropriate would be a user agent randomizer to help prevent bannination on those long runs, but it might be better off this way as it's more generic.

 A sneaky way to do it if you wanted to would be to use IE Agents that look to have been infected with a zillion kinds of spyware by sticking some random gibberish into the UA where bonzai buddy and FunWeb usually like to hang out.
Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #13 on: November 13, 2007, 03:32:26 PM »

i think perk is just assuming this is a base class that handles the core functionalities of requesting a page from the tubes.

I know that I for one, intend to add in a whole bunch of extra stuff, like an agent randomizer for example. Since perk has already parameterized everything anyway, you can do this simply, by writing your own random agent generator, and then doing class->userAgent = randomagentmaker(); before you getSimple or dispatch.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #14 on: November 13, 2007, 04:02:47 PM »

Some nice additions there!

All the help I'm getting from this place must be paying off as when I replicated the previous webRequest class using CURL instead i've taken a similar approach!

With regard to the splitting of internal and external links, I appreciate that is is unlikely, but is there the potential that the basehref tag may be set to an external domain, making relative links external? I've gone round in circles with this one when creating a little spider class recently, the use of regex here makes it look so easy  D'oh! - back to the drawing board for me.

DM
Logged
Pages: [1] 2 3 ... 5
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!