The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 21, 2019, 01:09:49 PM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: OO Google SERP Scraper  (Read 4510 times)
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« on: December 19, 2007, 09:57:23 AM »

Hey all,

Knocked together a Google SERP scraper recently, more as a learning exercise for myself than anything else - I've tried to use PHP5 OOP techniques wherever possible rather than the traditional regex approach to scraping.

It's nothing special but as I've got alot out of using these boards I thought I'd post it up. Its fairly bloated, but the implementation of the SeekableIterator maybe provides a nice working example of its scope in OO PHP (credit to Zend for my inspiration here).

I got bored adding validators for the search paramaters, if I ever get round to adding them in I'll update this post.

To get this to work the 'sendSearch' method in the Google_Search_Scraper class needs to be completed - it originally relied on my own web request class, and I figured as everyone has their own prefered way of doing this I'd leave it blank here. The method must return a string containing the HTML obtained.

Feedback is much appreciated, especially inline with my goals of learning to use the PHP DOM functionality and implementing fully OO PHP. To this end I do think that the base class that launches the search requests should probably just launch a single request, and the loop should live outside of this in order for it to be semantically correct. The parsing functions would then move to the result object, minimising processing until required. However I wanted to encapsulate the 'search request' in once class and thought that the approach was already overly complex as it stood without making it worse! I'd be interested in any thoughts on improving this situation though, particularly if they lead to a streamlined approach.

Cheers,

DM

Sample Usage:

Code:
// Test Run of Google Scraper

require_once('google_serp_scraper.php');

try
{
$scraper = new Google_Serp_Scraper('http://www.google.com/');
$results = $scraper->search('lion', 20);

echo "Total Results Returned: " . $results->totalResults() . "<br />";
echo "Estimated Total Results: {$results->totalResultsAvailable}<br /><br />";

foreach($results as $result)
{
echo "Rank: {$result->position} <br />";
echo "Title: {$result->title} <br />";
echo "URL: {$result->url} <br />";
echo "Summary: {$result->summary} <br />";
echo "CacheUrl: {$result->cacheUrl} <br /><br />";
}
}
catch(Exception $e)
{
echo 'Caught exception: ',  $e->getMessage(), "\n In File: ", $e->getFile(), "\n Trace: ", $e->getTraceAsString(), "\n";
}

?>

Class to follow.
Logged
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #1 on: December 19, 2007, 09:58:32 AM »

Launch Class:

Code:
<?php
/* -------------------------------------------------------------------------
 *
 * Google SERP Scraper
 *
 * - Purpose: Launches requests and collects responses
 *
 * - Usage:  Create new object with google domain to scrape and 
 *     any changes to valid search string paramaters.
 *
 *     Pass query and any additional options to 'search' method.
 *
 * - Returns:  Google_Serp_ResultSet Object
 *
 * - Author: gatecrasher1981@gmail.com
 *
 * Any feedback welcome.
 *
 * --------------------------------------------------------------------------*/

Class Google_Serp_Scraper
{

// -----------------
// Properties
// -----------------

// Public
public $baseDomain;
public $validOptions;

// Protected
protected $resultsTally;
protected $results;

// -----------------
// Constructor
// -----------------
public function __construct($domain, array $options = array())
{
// Valid Search Paramaters :: Format: $key = search param; $value = friendly name 
$validOptions  = array(
'hl'  => 'interfaceLanguage' // Validate
'btnG' => 'btnG',
'num' => 'results',
'oe' => 'outputEncoding' // Validate
'ie' => 'inputEncoding' // Validate
'qdr' => 'dateFilter',
'lr' => 'language' // Validate
'cr' => 'country' // Validate
'safe' => 'safeFilter' // Validate
'filter' => 'duplicateFilter' // Validate
'start' => 'start'
);

$this->validOptions array_merge($validOptions$options);

$this->validateDomain($domain);

$this->results $this->createResultsContainer();
}

// -----------------
// Methods
// -----------------

// -----------------
// Search Query
// -----------------
public function search($query$requestedResults 100$options = array())
{
// Set default options - minimum options required to get search to run
$defaultOptions  = array(
'interfaceLanguage'  => 'en',
'btnG' => 'Google Search'
);

$options array_merge($defaultOptions$options);

if(empty($query))
{
throw new exception('Query string must not be empty!');
}

$this->validateOptions($options);

$pagesRequired $this->getPagesRequired($requestedResults$options);

while($pagesReceived $pagesRequired && $this->resultsTally >= $expectedResultsTally)
{
if($pagesReceived 0)
{
$options['start'] = $requestedResults $pagesReceived 1;
}

$queryString $this->buildSearchString($query$options);
$resultPage $this->sendSearch($queryString);

$this->processResultsPage($resultPage);

sleep(rand(5,15));

$pagesReceived++;
$expectedResultsTally $pagesReceived $options['results'];
}

require_once('../libs/GoogleSerpScraper/google_serp_resultset.php');
return new Google_Serp_ResultSet($this->results);
}

// -----------------
// Get Number of Pages Required
// -----------------
protected function getPagesRequired($requestedResults, array $options)
{
if(empty($options['results']))
{
return ceil($requestedResults 10);
}

return ceil($requestedResults $options['results']);
}

// -----------------
// Validate Options
// -----------------
protected function validateOptions(array $options)
{
// Check there are no invalid options passed
$difference array_diff(array_keys($options), $this->validOptions);

if($difference)
{
throw new exception('Invalid option keys were passed');
}

// Validate number of results requested per page
if( isset($options['results']) && ($options['results'] < || $options['results'] > 100) )
{
throw new exception('Number of results per page must be between 1 - 100');
}

// Validate date option if set 
if( isset($options['dateFilter']) && preg_match('/^(d|m|y)[0-9]+$/'$options['dateFilter']) == )
{
throw new exception('Date Filter Option must be expressed as either d, m or y, followed by a number');
}
}

// -----------------
// Validate Domain
// -----------------
protected function validateDomain($domain)
{
// Sloppy link check, apply external link object validation where available
if(empty($domain) || !stristr($domain'google'))
{
throw new exception('A valid google domain to search must be supplied.');
}

$this->baseDomain $domain;
}

// -----------------
// Process Results Page
// -----------------
protected function processResultsPage($results)
{
$resultPage = new DOMDocument();

if(!@$resultPage->loadHTML($results))
{
throw new exception('Failed to load HTML from result page into DOM object') ;
}

$xpath = new DOMXpath($resultPage);

// Set estimated total results
$this->results->getElementsByTagName('EstimatedTotalResults')->item(0)->nodeValue $this->parseEstimatedTotalResults($xpath);

// Isolate results
$results $xpath->query('//div[@id="res"]//div[@class="g"]');

// Parse out each result
foreach($results as $result)
{
$resultNode $this->results->createElement('Result');

$resultNode->appendChild$this->parseTitle($result) );
$resultNode->appendChild$this->parseLink($result) );
$resultNode->appendChild$this->parseSummaryText($result$xpath) );
$resultNode->appendChild$this->parseCacheLink($result$xpath) );

$this->results->getElementsByTagName('ResultSet')->item(0)->appendChild$resultNode );
}

$this->resultsTally $this->results->getElementsByTagName('ResultSet')->length;
}

// -----------------
// Parse Estimated Total Results
// -----------------
protected function parseEstimatedTotalResults(DOMXPath $xpath)
{
$estimatedTotalResults $xpath->query('//table[@class="t bt"]//font//b[3]');

return $estimatedTotalResults->item(0)->nodeValue;
}

// -----------------
// Parse Title
// -----------------
protected function parseTitle(DOMNode $result)
{
$title htmlentities($result->getElementsByTagName('h2')->item(0)->nodeValue);

return new DOMElement('Title'$title);
}

// -----------------
// Parse Link
// -----------------
protected function parseLink(DOMNode $result)
{
$url htmlentities($result->getElementsByTagName('h2')->item(0)->getElementsByTagName('a')->item(0)->attributes->getNamedItem('href')->nodeValue);

return new DOMElement('URL'$url);
}

// -----------------
// Parse Summary
// -----------------
protected function parseSummaryText(DOMNode $resultDOMXPath $xpath)
{
$summary $xpath->query('.//font[@size = "-1"]'$result);

foreach($xpath->query('.//font[@size = "-1"]//a | //span'$result) as $deletes)
{
$replaceArray[] = $deletes->nodeValue;
}

$summary htmlentities(str_replace$replaceArray''$summary->item(0)->nodeValue ));

return new DOMElement('Summary'$summary);
}

// -----------------
// Parse Cache Link
// -----------------
protected function parseCacheLink(DOMNode $resultDOMXPath $xpath)
{
$cacheLinkResults $xpath->query('table//nobr/a[. = "Cached"]/@href'$result);

$cacheURL htmlentities($cacheLinkResults->item(0)->nodeValue);

return new DOMElement('CacheURL'$cacheURL);
}

// -----------------
// Build Search String
// -----------------
protected function buildSearchString($query$options)
{
$params['q'] = (string) $query;

foreach($options as $optionKey => $optionValue)
{
$translateKey array_search($optionKey$this->validOptions);
$params[$translateKey] = $optionValue;
}

// URL encodes and glues together $params array
return http_build_query($params);
}

// -----------------
// Send Query
// -----------------
protected function sendSearch($queryString)
{
$url $this->baseDomain 'search?' $queryString;

// --------------------------!
// INSERT WEB REQUEST TO $url HERE
// --------------------------!

// --------------------------!
// Check if the Web request was successful
// --------------------------!
if(! )
{
throw new exception('Web request to' $url 'failed.');
}

// --------------------------!
// Return string containing HTML received
// --------------------------!
return
}

// -----------------
// Create Results Container
// -----------------
protected function createResultsContainer()
{
$dom = new DOMDocument('1.0''UTF-8');
$dom->appendChild$dom->createElement('EstimatedTotalResults') );
$dom->appendChild$dom->createElement('ResultSet') );

return $dom;
}

}
Logged
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #2 on: December 19, 2007, 09:59:38 AM »

Result Set Class, default name google_serp_resultset.php

Code:
<?php
/* -------------------------------------------------------------------------
 *
 * Google SERP Result Set
 *
 * - Purpose: Implements a Seekable Iterator to loop through
 * collected search results
 *
 * - Usage:  Construct takes DOMDocument returned from SERP Scrape. Each
 * result in this object can then be accessed via standard loop
 * patterns e.g. while, foreach.
 *
 * totalResults method returns number of results in object
 *
 * - Returns:  Google_Serp_Result Object when iterated
 *
 * - Author: gatecrasher1981@gmail.com
 *
 * Any feedback welcome.
 *
 * --------------------------------------------------------------------------*/
require_once('google_serp_result.php');

Class 
Google_Serp_ResultSet implements SeekableIterator
{

// ----------------
// Define Properties
// ----------------

    public 
$totalResultsAvailable;
    public 
$totalResultsReturned;
    
protected $dom;
    protected 
$results;
    protected 
$currentIndex 0;

// ------------------
// Public Methods
// ------------------

    // --------------------
    // Parse the search response and retrieve the results for iteration
    // --------------------
    
public function __construct(DOMDocument $dom)
    {  
$this->dom $dom;

$xpath = new DOMXPath($dom);
        
$this->results $xpath->query('//ResultSet//Result');
        
        
$this->totalResultsAvailable  $dom->getElementsByTagName('EstimatedTotalResults')->item(0)->nodeValue;
        
$this->totalResultsReturned  = (int) $this->results->length;
    }

    
// --------------------
    // Total Number of results returned
    // --------------------   
    
public function totalResults()
    {
        return 
$this->totalResultsReturned;
    }

// --------------------
// Implement SeekableIterator
// --------------------

// --------------------
// Implement SeekableIterator::current()
// --------------------
    
public function current()
    {
        
// Return an instance of result Object
return new Google_Serp_Result($this->results->item($this->currentIndex), $this->currentIndex);
    }


// --------------------
//Implement SeekableIterator::key()
// --------------------
    
public function key()
    {
        return 
$this->currentIndex;
    }


// --------------------
// Implement SeekableIterator::next()
// --------------------
    
public function next()
    {
        
$this->currentIndex += 1;
    }

// --------------------
// Implement SeekableIterator::rewind()
// --------------------
    
public function rewind()
    {
        
$this->currentIndex 0;
    }


// --------------------
// Implement SeekableIterator::seek()
// --------------------
    
public function seek($index)
    {
        
$indexInt = (int) $index;
        
        if (
$indexInt >= && $indexInt $this->results->length
{
            
$this->currentIndex $indexInt;
        } 
else
{
            throw new 
OutOfBoundsException("Illegal index '$index'");
        }
    }

// --------------------
// Implement SeekableIterator::valid()
// --------------------
    
public function valid()
    {
        return 
$this->currentIndex $this->results->length;
    }
}
Logged
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #3 on: December 19, 2007, 10:01:15 AM »

Result Object - default name google_serp_result.php

Code:
<?php
/* -------------------------------------------------------------------------
 *
 * Google SERP Result Object
 *
 * - Purpose: Models a result object.
 *
 * - Usage:  Takes DOMElement representing result on construct
 *
 *     Access properties to get result values
 *
 * - Returns:  N/A
 *
 * - Author: gatecrasher1981@gmail.com
 *
 * Any feedback welcome.
 *
 * --------------------------------------------------------------------------*/

Class Google_Serp_Result
{

// -----------------
// Define Properties
// -----------------

    
public $position;
    public 
$type;
public $title;
public $summary;
public $cacheUrl;
public $url;

    protected 
$resultElement;

// -----------------
// Constructor
// -----------------
    
public function __construct(DOMElement $result$position)
    {
        
// Assign properties
        
$this->position       = $position 1;
        
$this->title $result->getElementsByTagName('Title')->item(0)->nodeValue;
        
$this->url  $result->getElementsByTagName('URL')->item(0)->nodeValue;
        
$this->summary     = $result->getElementsByTagName('Summary')->item(0)->nodeValue;
        
$this->cacheUrl      = $result->getElementsByTagName('CacheURL')->item(0)->nodeValue;

$this->resultElement $result;
    }
}
Logged
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #4 on: December 19, 2007, 11:21:31 AM »

  awful lot of code there.  ROFLMAO

Thanks for the contribution, I will have to check it out when I get time later.
Logged

hai
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #5 on: December 19, 2007, 01:24:37 PM »

As will I DM - I have an itch to redo my stuff, so this will be cool to go over how you tackled it.

Thanks muchly
/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
m0nkeymafia
Expert
****
Offline Offline

Posts: 240


Check it!


View Profile
« Reply #6 on: January 21, 2008, 08:12:27 AM »

Cheers DM, not tested / ran it yet but will try next few days!
Logged

I am Tyler Durden
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!