Perk's Spider Source

To The Live Thread

Thread: Perk's Spider Source

Back to category: PHP: Techniques, Classes & Examples

perkiset

<>How To Use It

You'll need to create some MySQL stuff first - the SQL is in following posts. Then you'll need to "seed" the crawler with the first domain and first page that you want to crawl. In this case, you'd want the domain to be <your domain> and the starting page to be '/'. The crawler will take it from there.

I run this from the command line. Reading through the code you'll see that it pops a couple different characters up as it walks through its job. At the end, it will have created a list of records in the crawl_pages table that are either found or not found, a compressed version of their content and the page that pointed to <this page>.

In this particular case, I keep the tk_pages table TEXT indexed so that I can use the data for searches on my own sites.

Unfortunately in my last playing around with the code I jumped pretty much into a

PHP

5 mentality, so the crawler will not longer work in 4.anything.

CAVEAT: I have stripped some of the really proprietary stuff out, but there is still stuff here that will not pertain to you OR you might wonder WTF is THAT there for... in which case please post and I'll fill you in.

IMPORTANT: I am expecting that if you take this code, grow it, change it or enhance it you'll let me know and let it grow here. Although there is no license agreement on it, it should all be considered GPL.

/p

perkiset

This is the domains table. Put the domain you want to crawl in here. the ID is autoincrement, so you just need to add the domain name (fully qualified like www.this.com) - you'll be using the domain id <here> in the other tables.

php

MyAdmin SQL Dump
-- version 2.8.2.1
-- http://www.

php

myadmin

.net

--
-- Host: localhost
-- Generation Time: Apr 26, 2007 at 05:30 PM
-- Server version: 5.0.24
--

PHP

Version: 5.2.1
--
-- Database: `temp`
--

-- --------------------------------------------------------

--
-- Table structure for table `crawl_domains`
--

CREATE TABLE `crawl_domains` (
`id` int(11) NOT NULL auto_increment,
`domain` varchar(12 Applause

NOT NULL default '',
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=2 DEFAULT CHARSET=latin1 AUTO_INCREMENT=2 ;

--
-- Dumping data for table `crawl_domains`
--

INSERT INTO `crawl_domains` (`id`, `domain`) VALUES (1, 'www.myfirstdomain.com');

perkiset

This is the pages table - you will need to seed it with domain id 1, the page being '/' for the spider to start at your root page.

php

MyAdmin SQL Dump
-- version 2.8.2.1
-- http://www.

php

myadmin

.net

--
-- Host: localhost
-- Generation Time: Apr 26, 2007 at 05:32 PM
-- Server version: 5.0.24
--

PHP

Version: 5.2.1
--
-- Database: `temp`
--

-- --------------------------------------------------------

--
-- Table structure for table `crawl_pages`
--

CREATE TABLE `crawl_pages` (
`id` int(11) NOT NULL auto_increment,
`siteid` int(11) NOT NULL default '0',
`url` varchar(12 Applause

NOT NULL default '',
`referrer` int(11) NOT NULL,
`crawlstate` tinyint(4) NOT NULL default '1',
`crawlfound` tinyint(4) NOT NULL default '0',
`lastcrawl` datetime NOT NULL default '0000-00-00 00:00:00',
`pagetitle` varchar(254) NOT NULL default '',
`avatar` varchar(254) NOT NULL default '',
`searchblob` text NOT NULL,
`lastping` datetime NOT NULL,
`nextping` datetime NOT NULL default '1980-01-01 00:00:00',
PRIMARY KEY (`id`),
UNIQUE KEY `siteid` (`siteid`,`url`),
KEY `crawlstate` (`crawlstate`,`url`),
KEY `nextping` (`nextping`),
FULLTEXT KEY `searchblob` (`pagetitle`,`searchblob`)
) ENGINE=MyISAM AUTO_INCREMENT=4471 DEFAULT CHARSET=latin1 AUTO_INCREMENT=4471 ;

perkiset

This is the "spiderlets" table - it is for the dispatcher to watch who's done and can he fire up some more. It is also to pass information to the spiderlet so that <it> knows what to do.

Yes, I could've used all sorts of other ways... execution params, pipes... all kindso shit. I did it this way because it was easy at the time and still makes too much sense to fight out something more efficient.

php

MyAdmin SQL Dump
-- version 2.8.2.1
-- http://www.

php

myadmin

.net

--
-- Host: localhost
-- Generation Time: Apr 26, 2007 at 05:35 PM
-- Server version: 5.0.24
--

PHP

Version: 5.2.1
--
-- Database: temp
--

-- --------------------------------------------------------

--
-- Table structure for table `crawl_spiderlets`
--

CREATE TABLE `crawl_spiderlets` (
`id` int(11) NOT NULL auto_increment,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 PACK_KEYS=0 AUTO_INCREMENT=1 ;

perkiset

Unfortunately, I need to post several support files as well... here is an example of the "paths.inc" file referenced in the dispatcher - it basically points to all the essential places my classes and code reference:

php

$GLOBALS['rootPath'] = $rootPath = '/www/sites/temp/cc';
$GLOBALS['pagePath'] = $pagePath = '/www/sites/temp/cc/pages';
$GLOBALS['systemPath'] = $systemPath = '/www/sites/temp/cc/system';
$GLOBALS['libPath'] = $libPath = '/www/sites/lib';
$GLOBALS['classPath'] = $classPath = '/www/sites/lib/classes';
$GLOBALS['themePath'] = $themePath = '/www/sites/temp/cc/theme';
$GLOBALS['fontPath'] = $fontPath = '/www/sites/lib/classes/fonts';
$GLOBALS['galleryPath'] = "/www/sites/temp/storage/galleries";
$GLOBALS['transPath'] = "/www/sites/temp/storage";

?>

perkiset

This post is to stop you so that you go up to the code repository and grab class.dbconnection.

php

as well - you'll need it for the spider.

You will also need to be have the webRequest class, located in the repository as well.

perkiset

This is the dispatcher. I call it crawler.

php

.

Note that this spider was NOT meant as a web crawler, it was meant as a site crawler. You can see a comment I made to myself shortly down the code where I say that it will need to be modified if I ever want to do more than one domain per crawl. I'll leave that up to you.

Note also that I grab a cookie called "sessionid" in the beginning - this is so that my own crawler doesn't beat the shit out of me like other bots... at least I should be able to see my own cookie and report back that <this bot> is a single user not a user-per-page-pull.

#! /usr/local/bin/

php

require_once('./paths.inc');
require_once("$classPath/class.webrequest.

php

");
require_once("$classPath/class.dbconnection.

php

");
$spiderletExec = "$systemPath/spiderlet.

php

";

$db = new dbConnection('127.0.0.1', 'auser', 'thepassword', 'thedatabase');
$GLOBALS['utilDB'] = &$db;

// THIS WILL HAVE TO BE UPDATED IF I EVEN INTEND TO DO MORE THAN ONE DOMAIN PER SPIDER
$http = new webRequest();
$http->Host = $db->singleAnswer("select domain from crawl_domains where id='1'");
$http->URL = '/';
$http->Get();
$sessionID = $http->GetCookie('sessionid');
//print "Setting SessionID=$sessionID ";

// Set all pages current in the database to "unfound" and "need to be worked"
$db->query("update crawl_pages set crawlstate=1, crawlfound=0");

$allDone = false;
while (!$allDone)
{
// Only go forward if there are 10 or less spiderlets running
$spiderlets = $db->singleAnswer("select count('x') from crawl_spiderlets");
if ($spiderlets >= 10)
{
print 'W';
sleep(1);
continue;
}

// There is a spiderlet slot open - go get a page
$nextID = $db->singleAnswer("select id from crawl_pages where crawlstate=1 limit 1");

if (($nextID <= ' ') && ($spiderlets == 0))
{
// No page - if there are no spiderlets either then I am done.
$allDone = true;
continue;
}

if ($nextID <= ' ')
{
print 'w';
// No page, but there were still spiderlets working... given them a chance.
sleep(1);
continue;
}

// There is a page to do and a free spiderlet slot.
print '.';
$db->query("insert into crawl_spiderlets() values()");
$spiderletID = $db->singleAnswer("select LAST_INSERT_ID()");
$db->query("update crawl_pages set crawlstate=2 where id=$nextID");
$execStr = "$spiderletExec $nextID $spiderletID $sessionID > /dev/null &";
exec($execStr);

// for testing...
// $execStr = "$spiderletExec $nextID $spiderletID $sessionID 1";
// print shell_exec($execStr);

}

print chr(10);

// Reset the ID auto-incrementer to 1...
$db->query("truncate table crawl_spiderlets");

?>

perkiset

Here's the juice: the actual spiderlet.

This script is executed by the crawler. If you read the crawler script you will see that I currently let 10 run at once. This is an arbitrary number.

There is a little chunk in the beginning of the code, talking about "customWords" - this is a little file I use to "augment" certain words. For example, I have sites that deal with things that might be red, might be a bra, might be new etc - and unfortunately, MySQL by default does not TEXT index words that are only 3 chars long. You can modify this and recompile, but it puts a burden on the search engine and suddenly things don't work quite so nicely. So instead, I have a little file that looks like this:

php

// I will have been called where arrays searchWords and replaceWords are already in existence.
// I just need to replace words I want special-tagged here. Words that often need tagging are
// 3 letter words or incredibly common words.

$customSearchWords[] = 'aid';
$customReplaceWords[] = 'aidxx';

?>

... where 3 letter words are added to in such a way that they will not be normal words, but I can use that when a surfer types them into a search box. For example, someone types in "aid" and in the background I start searching for "aidxx."

Another thing you'll notice about the code is that I exclude any file that doesn't look like it's going to return HTML to me... images, zips - you name it. Makes the handling of weirdness later easier.

<edit> Sorry - looks like the forum flubbed up the tabbing a bit... you'll need to clean yours manually</edit>

#! /usr/local/bin/

php

//error_reporting(E_ALL);

require_once('./paths.inc');
require_once("$classPath/class.webrequest.

php

");
require_once("$classPath/class.dbconnection.

php

");
require_once("$rootPath/localvars.

php

");
$customWords = "$systemPath/custom.searchWords.

php

";

if (file_exists($customWords))
{
$customSearchWords = '';
$customReplaceWords = '';
include($customWords);
$GLOBALS['customsearch']['search'] = &$customSearchWords;
$GLOBALS['customsearch']['replace'] = &$customReplaceWords;
}

$db = new dbConnection($db_host, $db_user, $db_password, $db_database);
$GLOBALS['utilDB'] = &$db;
$linkID = $_SERVER['argv'][1];
$spiderletID = $_SERVER['argv'][2];
$sessionID = $_SERVER['argv'][3];
$verbose = $_SERVER['argv'][4];
$quiet = ($verbose) ? false : true;

$spiderlet = new Spiderlet($linkID, $spiderletID, $quiet, $sessionID);
$spiderlet->Crawl();

class Spiderlet{

var $db;
var $http;
var $title;
var $pageAvatar;
var $content;
var $linkID;
var $releaseID;
var $badLinks = array();
var $badChars = array();
var $links = array();
var $quiet;
var $sessionID;

function __construct($myLinkID, $myReleaseID, $qt=true, $sID='')
{
$this->quiet = $qt;
$this->sessionID = $sID;

$this->linkID = $myLinkID;
$this->releaseID = $myReleaseID;

$this->db = &$GLOBALS['utilDB'];
$this->http = new WebRequest();
$this->http->Port = 80;
$this->http->endOnBody = true;
$this->http->timeout = 10;
$this->http->succeedOnTimeout = true;

// Create arrays for later...
$this->badLinks[] = 'file:';
$this->badLinks[] = 'news:';
$this->badLinks[] = 'ftp:';
$this->badLinks[] = 'mailto:';
$this->badLinks[] = 'te

lnet

:';
$this->badLinks[] = '

javascript

:';
$this->badLinks[] = 'https:';
$this->badLinks[] = '.gif';
$this->badLinks[] = '.jpg';
$this->badLinks[] = '.png';
$this->badLinks[] = '.pdf';
$this->badLinks[] = '.tar';
$this->badLinks[] = '.zip';
$this->badLinks[] = '.rpm';
$this->badLinks[] = '.mp3';
$this->badLinks[] = '.aac';
$this->badLinks[] = '.wmf';
$this->badLinks[] = '.mov';
$this->badLinks[] = '.com';
$this->badLinks[] = '.deb';
$this->badLinks[] = '.tgz';
$this->badLinks[] = '.gz';
$this->badLinks[] = '.rtf';
$this->badLinks[] = '.doc';
$this->badLinks[] = '.aiff';
$this->badLinks[] = '.wav';
$this->badLinks[] = '.tif';
$this->badLinkStr = implode('|', $this->badLinks);

$this->badChars[] = chr(13);
$this->badChars[] = chr(9);
}

function __destruct() { $this->db->query("delete from crawl_spiderlets where id={$this->releaseID}"); }

function AcceptableLink($linkStr) { return (!preg_match("/{$this->badLinkStr}/", $linkStr)); }

function Crawl()
{
// Get the job from the database...
$this->db->query("select crawl_pages.*, domain from crawl_pages, crawl_domains where crawl_pages.id={$this->linkID} and crawl_domains.id=crawl_pages.siteid");
$this->db->fetchArray();
$siteID = $this->db->row['siteid'];

// Setup the requestor...
$this->http->Host = strtolower($this->db->row['domain']);
$this->http->URL = $this->db->row['url'];

// It's possible the that crawler main put a sessionID into me that I need to pass
// to the page I call as a cookie...
if ($this->sessionID) { $this->http->SetCookie('sessionid', $this->sessionID); }

// Get the page
if (!$buff = $this->http->Get())
{
$db->query("update crawl_pages set crawlstate=-1 where id={$this->linkID}");
exit;
}
$this->buffer = $this->http->Content();
$DNI = preg_match('/rfspider: donotindex/', $this->buffer);

// It's is an error, get out quick
if (!(strpos($this->buffer, '404') === false))
{
$this->db->query("update crawl_pages set crawlstate=-1, pagetitle='', searchblob='{$this->buffer}' where id={$this->linkID}");
exit;
}

// Distill the page into my content...
$this->ExtractTitle();
$this->ExtractPageAvatar();
$this->GatherLinks();
$this->FinalCleaning();

// Update the database now...
$now = date('Y-m-d H:i:s', time());
$title = mysql_escape_string($this->title);
$this->db->query("update crawl_pages set crawlstate=0, crawlfound=1, lastcrawl='$now', " .
"pagetitle='$title', avatar='{$this->pageAvatar}', searchblob='{$this->content}' where id={$this->linkID}");

// Now insert links. The table is UNIQUE indexed, so it will fail if the page already exists...
for ($i=0; $i<count($this->links); $i++)
{
// Note that "to work" values of crawlstate and crawlfound are set by the DB...
$newPage = $this->links[$i];
if (($newPage == $this->http->URL) || ($newPage <= ' ')) { continue; }
$this->db->query("insert into crawl_pages(siteid, url, referrer) values($siteID, '{$this->links[$i]}', {$this->linkID})", true);
}

// Now this looks rather bizarre, but if <this page> didn't want to
// be in the index, eliminate it. It would, however, have contributed
// to the links to-do list by now...
if ($DNI) { $this->db->query("delete from crawl_pages where id={$this->linkID}"); }

}

function ExtractPageAvatar()
{
// The pageavatar is a graphic that can be used for search results.
// If it's in the page, it'll be like this: 
preg_match('/pageavatar:[ ]*([^ ]*)/i', $this->buffer, $matches);
if ($matches[1]) { $this->pageAvatar = mysql_escape_string($matches[1]); }
}

function ExtractTitle()
{
$this->title = '[ No Page Title ]';
preg_match('/<title>([^<]*)/i', $this->buffer, $matches);
if ($matches[1]) { $this->title = $matches[1]; }

$searchWords = &$GLOBALS['customsearch']['search'];
if ($searchWords) { $this->title = str_replace($searchWords, &$GLOBALS['customsearch']['replace'], $this->title); }
}

function FinalCleaning()
{
$

regex

= (strpos($this->buffer, '(.*)<!-- endcontent/ismU' : '/<body(.*)$/ismU';
preg_match($

regex

, $this->buffer, $matches);

$cleanArr = array();
$cleanArr[] = '/(<script.*</script> Applause

/imsU';
$cleanArr[] = '/(<style.*</style> Applause

/imsU';
$cleanArr[] = '/( Applause

/imsU';
$replArr = array(' ', ' ', ' ');
$this->buffer = preg_replace($cleanArr, $replArr, $this->buffer);
$this->buffer = strip_tags(str_replace($this->badChars, '', $this->buffer));

$searchWords = &$GLOBALS['customsearch']['search'];
if ($searchWords) { $this->buffer = str_replace($searchWords, &$GLOBALS['customsearch']['replace'], $this->buffer); }

$outArr = array();
$inArr = explode(chr(10), $this->buffer);
foreach($inArr as $line)
{
if ($line = trim($line)) { $outArr[] = $line; }
}
$this->content = implode(' ', $outArr);
while (strpos($this->content, ' ') > 0) { $this->content = str_replace(' ', ' ', $this->content); }
$this->content = mysql_escape_string($this->content);
}

function GatherLinks()
{
$ptr = 0;
$rawBuff = $this->buffer;
preg_match_all('/href="([^"]*)/ims', $rawBuff, $matches);
foreach($matches[1] as $thisURL)
{
if (preg_match('/http:/', $thisURL))
{
// It MIGHT be an outbound...
preg_match('/http://([^/]*)(.*)$/', $thisURL, $matches);
$thisHost = $matches[1];
$thisURL = $matches[2];
if (strtolower($thisHost) == strtolower($this->http->Host)) {
if ($this->AcceptableLink($thisURL)) {
if (!in_array($thisURL, $this->links)) {
array_push($this->links, $thisURL);
}
}
}
} else {
if ($this->AcceptableLink($thisURL)) {
if (!in_array($thisURL, $this->links)) {
array_push($this->links, $thisURL);
}
}
}
}
}
}

?>

Caligula

Awesome Perk! That must have taken forever! Great stuff bro.. thanks for sharing! Applause

perkiset

Glad you like it Calig - lemme know how it does for you.

/p

KaptainKrayola

man perk your code is so neat and easy to follow. 9 thumbs up from the Kaptain

perkiset

Thanks Kapn - that's because I'm an idiot and know how quickly I forget.

*6 months pass*

"I never wrote that! What kinda asshole wrote that?!?! It needs to be completely redone."

Applause

So I try really hard to minimize self-inflicted Codezheimers.

/p

thedarkness

quote author=KaptainKrayola link=topic=125.msg681#msg681 date=1177642980

man perk your code is so neat and easy to follow. 9 thumbs up from the Kaptain

The Kaptain has nine thunbs? Some sort of serious interbreeding? Was your Dad related to your Mum before they got Married? Small town Tortuga huh? Isolated?

Cheers,
td

thedarkness

quote author=perkiset link=topic=125.msg682#msg682 date=1177643122

"I never wrote that! What kinda asshole wrote that?!?! It needs to be completely redone."

Yeah, I often say that just b4 I realise I was the asshole (still am :-) )

td

P.S. Rockin' werk perk

[edit]Forgot to conjugate the verb "to go"[/edit]

KaptainKrayola

What makes you think all the thumbs belong to the Kaptain? What kind of pirate doesn't have extra thumbs laying around just in case?

thedarkness

I always carry an extra one in my............. well........ never mind....

Caligula

oh thats just wrong....

Dbyt3r

Now, all you need is one gigantic AI script or

regex

script to identify every page as trackback | blog | guestbook etc and spam them accordingly ; Applause

Thread Categories

		Best of The Cache Home
		Search The Cache