DangerMouse

Hey all,

Knocked together a Google SERP scraper recently, more as a

learn

 ing  exercise for myself than anything else - I've tried to use

PHP

 5 OOP techniques wherever possible rather than the traditional

regex

  approach to scraping.

It's nothing special but as I've got alot out of using these boards I thought I'd post it up. Its fairly bloated, but the implementation of the SeekableIterator maybe provides a nice working example of its scope in OO

PHP

  (credit to Zend for my inspiration here).

I got bored adding validators for the search paramaters, if I ever get round to adding them in I'll update this post.

[font=Verdana]<>To get this to work the 'sendSearch' method in the Google_Search_Scraper class needs to be completed - it originally relied on my own web request class, and I figured as everyone has their own prefered way of doing this I'd leave it blank here. The method must return a string containing the HTML obtained.[/font]

Feedback is much appreciated, especially inline with my goals of

learn

 ing  to use the

PHP

  DOM functionality and implementing fully OO

PHP

 . To this end I do think that the base class that launches the search requests should probably just launch a single request, and the loop should live outside of this in order for it to be semantically correct. The parsing functions would then move to the result object, minimising processing until required. However I wanted to encapsulate the 'search request' in once class and thought that the approach was already overly complex as it stood without making it worse! I'd be interested in any thoughts on improving this situation though, particularly if they lead to a streamlined approach.

Cheers,

DM

Sample Usage:


// Test Run of Google Scraper

require_once('google_serp_scraper.

php

 ');

try
{
$scraper = new Google_Serp_Scraper('http://www.google.com/');
$results = $scraper->search('lion', 20);

echo "Total Results Returned: " . $results->totalResults() . "<br />";
echo "Estimated Total Results: {$results->totalResultsAvailable}<br /><br />";

foreach($results as $result)
{
echo "Rank: {$result->position} <br />";
echo "Title: {$result->title} <br />";
echo "URL: {$result->url} <br />";
echo "Summary: {$result->summary} <br />";
echo "CacheUrl: {$result->cacheUrl} <br /><br />";
}
}
catch(Exception $e)
{
echo 'Caught exception: ',  $e->getMessage(), " In File: ", $e->getFile(), " Trace: ", $e->getTraceAsString(), " ";
}

?>


Class to follow.

DangerMouse

Launch Class:


<?

php

 
/* -------------------------------------------------------------------------
*
* Google SERP Scraper
*
* - Purpose: Launches requests and collects responses
*
* - Usage: Create new object with google domain to scrape and
*     any changes to valid search string paramaters.
*
*     Pass query and any additional options to 'search' method.
*
* - Returns: Google_Serp_ResultSet Object
*
* - Author: gatecrasher1981@gmail.com
*
* Any feedback welcome.
*
* --------------------------------------------------------------------------*/

Class Google_Serp_Scraper
{

// -----------------
// Properties
// -----------------

// Public
public $baseDomain;
public $validOptions;

// Protected
protected $resultsTally;
protected $results;

// -----------------
// Constructor
// -----------------
public function __construct($domain, array $options = array())
{
// Valid Search Paramaters :: Format: $key = search param; $value = friendly name
$validOptions = array(
'hl' => 'interfaceLanguage', // Validate
'btnG' => 'btnG',
'num' => 'results',
'oe' => 'outputEncoding', // Validate
'ie' => 'inputEncoding', // Validate
'qdr' => 'dateFilter',
'lr' => 'language', // Validate
'cr' => 'country', // Validate
'safe' => 'safeFilter', // Validate
'filter' => 'duplicateFilter', // Validate
'start' => 'start'
);

$this->validOptions = array_merge($validOptions, $options);

$this->validateDomain($domain);

$this->results = $this->createResultsContainer();
}

// -----------------
// Methods
// -----------------

// -----------------
// Search Query
// -----------------
public function search($query, $requestedResults = 100, $options = array())
{
// Set default options - minimum options required to get search to run
$defaultOptions = array(
'interfaceLanguage' => 'en',
'btnG' => 'Google Search'
);

$options = array_merge($defaultOptions, $options);

if(empty($query))
{
throw new exception('Query string must not be empty!');
}

$this->validateOptions($options);

$pagesRequired = $this->getPagesRequired($requestedResults, $options);

while($pagesReceived < $pagesRequired && $this->resultsTally >= $expectedResultsTally)
{
if($pagesReceived > 0)
{
$options['start'] = $requestedResults * $pagesReceived + 1;
}

$queryString = $this->buildSearchString($query, $options);
$resultPage = $this->sendSearch($queryString);

$this->processResultsPage($resultPage);

sleep(rand(5,15));

$pagesReceived++;
$expectedResultsTally = $pagesReceived * $options['results'];
}

require_once('../libs/GoogleSerpScraper/google_serp_resultset.

php

 ');
return new Google_Serp_ResultSet($this->results);
}

// -----------------
// Get Number of Pages Required
// -----------------
protected function getPagesRequired($requestedResults, array $options)
{
if(empty($options['results']))
{
return ceil($requestedResults / 10);
}

return ceil($requestedResults / $options['results']);
}

// -----------------
// Validate Options
// -----------------
protected function validateOptions(array $options)
{
// Check there are no invalid options passed
$difference = array_diff(array_keys($options), $this->validOptions);

if($difference)
{
throw new exception('Invalid option keys were passed');
}

// Validate number of results requested per page
if( isset($options['results']) && ($options['results'] < 1 || $options['results'] > 100) )
{
throw new exception('Number of results per page must be between 1 - 100');
}

// Validate date option if set
if( isset($options['dateFilter']) && preg_match('/^(d|m|y)[0-9]+$/', $options['dateFilter']) == 0 )
{
throw new exception('Date Filter Option must be expressed as either d, m or y, followed by a number');
}
}

// -----------------
// Validate Domain
// -----------------
protected function validateDomain($domain)
{
// Sloppy link check, apply external link object validation where available
if(empty($domain) || !stristr($domain, 'google'))
{
throw new exception('A valid google domain to search must be supplied.');
}

$this->baseDomain = $domain;
}

// -----------------
// Process Results Page
// -----------------
protected function processResultsPage($results)
{
$resultPage = new DOMDocument();

if(!@$resultPage->loadHTML($results))
{
throw new exception('Failed to load HTML from result page into DOM object') ;
}

$xpath = new DOMXpath($resultPage);

// Set estimated total results
$this->results->getElementsByTagName('EstimatedTotalResults')->item(0)->nodeValue = $this->parseEstimatedTotalResults($xpath);

// Isolate results
$results = $xpath->query('//div[@id="res"]//div[@class="g"]');

// Parse out each result
foreach($results as $result)
{
$resultNode = $this->results->createElement('Result');

$resultNode->appendChild( $this->parseTitle($result) );
$resultNode->appendChild( $this->parseLink($result) );
$resultNode->appendChild( $this->parseSummaryText($result, $xpath) );
$resultNode->appendChild( $this->parseCacheLink($result, $xpath) );

$this->results->getElementsByTagName('ResultSet')->item(0)->appendChild( $resultNode );
}

$this->resultsTally = $this->results->getElementsByTagName('ResultSet')->length;
}

// -----------------
// Parse Estimated Total Results
// -----------------
protected function parseEstimatedTotalResults(DOMXPath $xpath)
{
$estimatedTotalResults = $xpath->query('//table[@class="t bt"]//font//b[3]');

return $estimatedTotalResults->item(0)->nodeValue;
}

// -----------------
// Parse Title
// -----------------
protected function parseTitle(DOMNode $result)
{
$title = htmlentities($result->getElementsByTagName('h2')->item(0)->nodeValue);

return new DOMElement('Title', $title);
}

// -----------------
// Parse Link
// -----------------
protected function parseLink(DOMNode $result)
{
$url = htmlentities($result->getElementsByTagName('h2')->item(0)->getElementsByTagName('a')->item(0)->attributes->getNamedItem('href')->nodeValue);

return new DOMElement('URL', $url);
}

// -----------------
// Parse Summary
// -----------------
protected function parseSummaryText(DOMNode $result, DOMXPath $xpath)
{
$summary = $xpath->query('.//font[@size = "-1"]', $result);

foreach($xpath->query('.//font[@size = "-1"]//a | //span', $result) as $deletes)
{
$replaceArray[] = $deletes->nodeValue;
}

$summary = htmlentities(str_replace( $replaceArray, '', $summary->item(0)->nodeValue ));

return new DOMElement('Summary', $summary);
}

// -----------------
// Parse Cache Link
// -----------------
protected function parseCacheLink(DOMNode $result, DOMXPath $xpath)
{
$cacheLinkResults = $xpath->query('table//nobr/a[. = "Cached"]/@href', $result);

$cacheURL = htmlentities($cacheLinkResults->item(0)->nodeValue);

return new DOMElement('CacheURL', $cacheURL);
}

// -----------------
// Build Search String
// -----------------
protected function buildSearchString($query, $options)
{
$params['q'] = (string) $query;

foreach($options as $optionKey => $optionValue)
{
$translateKey = array_search($optionKey, $this->validOptions);
$params[$translateKey] = $optionValue;
}

// URL encodes and glues together $params array
return http_build_query($params);
}

// -----------------
// Send Query
// -----------------
protected function sendSearch($queryString)
{
$url = $this->baseDomain . 'search?' . $queryString;

// --------------------------!
// INSERT WEB REQUEST TO $url HERE
// --------------------------!

// --------------------------!
// Check if the Web request was successful
// --------------------------!
if(! )
{
throw new exception('Web request to' . $url . 'failed.');
}

// --------------------------!
// Return string containing HTML received
// --------------------------!
return
}

// -----------------
// Create Results Container
// -----------------
protected function createResultsContainer()
{
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->appendChild( $dom->createElement('EstimatedTotalResults') );
$dom->appendChild( $dom->createElement('ResultSet') );

return $dom;
}

}

DangerMouse

Result Set Class, default name google_serp_resultset.

php

 


<?

php

 
/* -------------------------------------------------------------------------
*
* Google SERP Result Set
*
* - Purpose: Implements a Seekable Iterator to loop through
* collected search results
*
* - Usage: Construct takes DOMDocument returned from SERP Scrape. Each
* result in this object can then be accessed via standard loop
* patterns e.g. while, foreach.
*
* totalResults method returns number of results in object
*
* - Returns: Google_Serp_Result Object when iterated
*
* - Author: gatecrasher1981@gmail.com
*
* Any feedback welcome.
*
* --------------------------------------------------------------------------*/
require_once('google_serp_result.

php

 ');

Class Google_Serp_ResultSet implements SeekableIterator
{

// ----------------
// Define Properties
// ----------------

    public $totalResultsAvailable;
    public $totalResultsReturned;
   
protected $dom;
    protected $results;
    protected $currentIndex = 0;

// ------------------
// Public Methods
// ------------------

    // --------------------
    // Parse the search response and retrieve the results for iteration
    // --------------------
    public function __construct(DOMDocument $dom)
    { 
$this->dom = $dom;

$xpath = new DOMXPath($dom);
        $this->results = $xpath->query('//ResultSet//Result');
       
        $this->totalResultsAvailable = $dom->getElementsByTagName('EstimatedTotalResults')->item(0)->nodeValue;
        $this->totalResultsReturned = (int) $this->results->length;
    }

    // --------------------
    // Total Number of results returned
    // -------------------- 
    public function totalResults()
    {
        return $this->totalResultsReturned;
    }

// --------------------
// Implement SeekableIterator
// --------------------

// --------------------
// Implement SeekableIterator::current()
// --------------------
    public function current()
    {
        // Return an instance of result Object
return new Google_Serp_Result($this->results->item($this->currentIndex), $this->currentIndex);
    }


// --------------------
//Implement SeekableIterator::key()
// --------------------
    public function key()
    {
        return $this->currentIndex;
    }


// --------------------
// Implement SeekableIterator::next()
// --------------------
    public function next()
    {
        $this->currentIndex += 1;
    }

// --------------------
// Implement SeekableIterator::rewind()
// --------------------
    public function rewind()
    {
        $this->currentIndex = 0;
    }


// --------------------
// Implement SeekableIterator::seek()
// --------------------
    public function seek($index)
    {
        $indexInt = (int) $index;
       
        if ($indexInt >= 0 && $indexInt < $this->results->length)
{
            $this->currentIndex = $indexInt;
        }
else
{
            throw new OutOfBoundsException("Illegal index '$index'");
        }
    }

// --------------------
// Implement SeekableIterator::valid()
// --------------------
    public function valid()
    {
        return $this->currentIndex < $this->results->length;
    }
}

DangerMouse

Result Object - default name google_serp_result.

php

 


<?

php

 
/* -------------------------------------------------------------------------
*
* Google SERP Result Object
*
* - Purpose: Models a result object.
*
* - Usage: Takes DOMElement representing result on construct
*
*     Access properties to get result values
*
* - Returns: N/A
*
* - Author: gatecrasher1981@gmail.com
*
* Any feedback welcome.
*
* --------------------------------------------------------------------------*/

Class Google_Serp_Result
{

// -----------------
// Define Properties
// -----------------

    public $position;
    public $type;
public $title;
public $summary;
public $cacheUrl;
public $url;

    protected $resultElement;

// -----------------
// Constructor
// -----------------
    public function __construct(DOMElement $result, $position)
    {
        // Assign properties
        $this->position       = $position + 1;
        $this->title = $result->getElementsByTagName('Title')->item(0)->nodeValue;
        $this->url = $result->getElementsByTagName('URL')->item(0)->nodeValue;
        $this->summary     = $result->getElementsByTagName('Summary')->item(0)->nodeValue;
        $this->cacheUrl     = $result->getElementsByTagName('CacheURL')->item(0)->nodeValue;

$this->resultElement = $result;
    }
}

vsloathe

Applause awful lot of code there.  Applause

Thanks for the contribution, I will have to check it out when I get time later.

perkiset

As will I DM - I have an itch to redo my stuff, so this will be cool to go over how you tackled it.

Thanks muchly
/p


Perkiset's Place Home   Politics @ Perkiset's