The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. February 11, 2012, 11:04:04 PM

Login with username, password and session length


Pages: [1] 2
  Print  
Author Topic: The World's Simplest Crawler  (Read 1547 times)
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 9792



View Profile
« on: August 31, 2009, 12:43:13 PM »

OK, probably not the simplest, but certainly not a complicated thing.

I have several WH sites that need a sitemap and searchability, but I am no longer in control of the content - the clients, or PinkHat, or any number of sources might contribute/change the content of a site.

So I have a cron job that runs nightly and executes a little single-site crawler, then places searchable content into a database as well as creating the content for a static sitemap. It's really, really simple, but might offer the fledgling spider author some ideas. Here is the most basic way of calling it. The print_r at the end is just a really simple way of seeing everything that the object contains when it's done. Note that I don't do any real checking of the site, nor is it capable of handling relative URLs (I don't use any for my sites).

$crawler = new simpleCrawler();
$crawler->domain 'www.mydomain.com';
$crawler->crawl();
print_r($crawler);

Here is the code:
<?php

class simpleCrawler
{
	
private 
$todo$done$pageBuff$currentURL;
	
public 
$domain$site$pages$debug$debugMax;
	

	
private function 
compileSite()
	
{
	
	
foreach(
$this->pages as $page)
	
	
{
	
	
	
foreach(
$page['internal'] as $url=>$dummy$iTemp[$url] = true;
	
	
	
foreach(
$page['external'] as $url=>$dummy$eTemp[$url] = true;
	
	
}

	
	
foreach(
$iTemp as $url=>$dummy$this->site['internal'][] = $url;
	
	
foreach(
$eTemp as $url=>$dummy$this->site['external'][] = $url;
	
	

	
	
sort($this->site['internal']);
	
	
sort($this->site['external']);
	
}
	

	
private function 
debug($msg) { if ($this->debug) echo "$msg\n"; }
	

	
private function 
extractContent()
	
{
	

	
	
// I do this because I like seeing the searchable before the raw when I print_r
	
	
$this->pages[$this->currentURL]['searchable'] = false;
	
	

	
	
$ptr stripos($this->pageBuff'<body');
	
	
$buff trim(strip_tags(substr($this->pageBuff$ptrstrlen($this->pageBuff))));
	
	
$buff trim(str_ireplace(array('&nbsp;'chr(10), "\t"), ' '$buff));
	
	
while (
strpos($buff'  '))  $buff str_replace('  '' '$buff);
	
	
$this->pages[$this->currentURL]['content'] = $buff;
	
	

	
	
$buff trim(str_replace(array('-''_''.'':''/''\\'','), ' '$buff));
	
	
$buff preg_replace('/[^A-Z0-9 \r]/i'''$buff);
	
	
while (
strpos($buff'  '))  $buff str_replace('  '' '$buff);
	
	
$words explode(' '$buff);
	
	
$outWords = array();
	
	
foreach(
$words as $word)
	
	
{
	
	
	
if (
strlen($word) < 4) continue;
	
	
	
if (
preg_match('/^[0-9]*$/'$word)) continue;
	
	
	
$outWords[] = $word;
	
	
}
	
	
$this->pages[$this->currentURL]['searchable'] = implode(' '$outWords);
	
}
	

	
private function 
extractTitle()
	
{
	
	
preg_match('/<title>(.*)<\/title>/i'$this->pageBuff$parts);
	
	
$this->pages[$this->currentURL]['title'] = $parts[1];
	
}

	
private function 
extractURLs()
	
{
	
	
$this->pages[$this->currentURL]['internal'] = array();
	
	
$this->pages[$this->currentURL]['external'] = array();

	
	
preg_match_all('/<a href="([^"]*)/i'$this->pageBuff$thisArr);
	
	
foreach(
$thisArr[1] as $url)
	
	
{
	
	
	
$url trim($url);
	
	
	
if (
$url[0] == '/'$url "http://{$this->domain}$url";
	
	
	

	
	
	
if (
	
	
	
	
(
preg_match('~^https://~i'$url)) or
	
	
	
	
(
strpos($url'?')) or
	
	
	
	
(
preg_match('/^mailto/i'$url)) or
	
	
	
	
(
preg_match('/\.pdf$/i'$url))
	
	
	
	
)
	
	
	
	
continue;
	
	
	

	
	
	
if (!
preg_match("~http://{$this->domain}~i"$url))
	
	
	
{
	
	
	
	
$this->pages[$this->currentURL]['external'][$url] = true;
	
	
	
	
$this->debug("DENY
	
$url");
	
	
	
	
continue;
	
	
	
}
	
	
	

	
	
	
$this->pages[$this->currentURL]['internal'][$url] = true;
	
	
	
if (!
$this->done[$url])
	
	
	
{
	
	
	
	
if (!
in_array($url$this->todo))
	
	
	
	
{
	
	
	
	
	
$this->todo[] = $url;
	
	
	
	
	
$this->debug("TODO
	
$url");
	
	
	
	
}
	
	
	
} else 
$this->debug("DONE
	
$url"); 
	
	
}
	
}
	
 
	
public function 
crawl()
	
{
	
	
if (!
$this->domain
	
	
	
throw new 
Exception('simpleCrawler: You cannot crawl without specifying a domain.');
	
	

	
	
$this->pages = array();
	
	
$this->site = array();
	
	
$this->todo = array();
	
	
$this->done = array();
	
	
$this->todo[] = "http://{$this->domain}/";
	
	

	
	
$counter 0;
	
	
while (
count($this->todo))
	
	
{
	
	
	
$thisURL $this->currentURL array_shift($this->todo);
	
	
	
$this->debug("\n\nCRAWL
	
$thisURL");
	
	
	
$this->done[$thisURL] = true;
	
	
	
$this->pageBuff file_get_contents($thisURL);
	
	
	
$this->pages[$this->currentURL]['url'] = $thisURL;
	
	
	
$this->extractTitle();
	
	
	
$this->extractURLs();
	
	
	
$this->extractContent();

	
	
	
$ptr count($this->urlList);

	
	
	
if (
$this->debug)
	
	
	
	
if (
$this->debugMax)
	
	
	
	
	
if (
$counter++ > $this->debugMax)
	
	
	
	
	
	
break;
	
	
}
	
	

	
	
$this->compileSite();
	
}
	

}

Enjoy!

<add>
Added so that VSloathe would be more impressed with the output:  ROFLMAO
After the crawl is done, you have $crawler->pages which is an array of arrays containing the internal & external links on each page, as well as each page's title, display content and searchable content for the database. You also have $crawler->site['internal'] and $crawler->site['external'] which contains all the internal links and external links on the entire site, most helpful for building a sitemap.
</add>
« Last Edit: August 31, 2009, 02:43:47 PM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 9792



View Profile
« Reply #1 on: August 31, 2009, 12:45:25 PM »

As an example of how to go forward, here is a small descendant class that has a cloud keyword creator. The added function will return the top 100 words in a site, ranked 0..9 (0 being the highest, 9 being the lowest) so that you could they style them accordingly to create your keyword cloud.

class simpleCrawlerWithCloud extends simpleCrawler
{
	
public 
$excludes
	
public 
$numWords;
	

	
function 
__construct()
	
{
	
	
$this->excludes = array(
	
	
	
'about''above''added''after''again''asked''began''begin''below''chose''clear''close''comes'
	
	
	
'could''doing''enter''every''extra''feels''final''first''found''front''given''gives''going'
	
	
	
'great''happy''known''knows''label''large''lasts''later''leave''links''lived''lives''lower'
	
	
	
'makes''means''meant''meets''moved''needs''never''newer''often''older''other''piece''place'
	
	
	
'since''sizes''small''start''still''taken''takes''tells''thank''thats''their''there''these'
	
	
	
'thing''those''under''until''upper''using''usual''wants''wasnt''where''which''while''whole'
	
	
	
'whose''words''world''worst''worth''would');
	
	
	

	
	
$this->numWords 99// actually, 100 items, zero indexed
	
}
	

	
public function &
getWords()
	
{
	
	
// This will build a hashed associative array of all the words
	
	
// that are to be excluded, including anything that the user added.
	
	
foreach(
$this->excludes as $word$excludes[$word] = true;
	
	
	

	
	
$wordCount = array();
	
	
foreach(
$this->pages as $page)
	
	
{
	
	
	
$words explode(' '$page['searchable']);
	
	
	
foreach(
$words as $word)
	
	
	
{
	
	
	
	
if (
strlen($word) < 5) continue;
	
	
	
	
if (
$excludes[$word]) continue;
	
	
	
	
$wordCount[strtolower($word)]++;
	
	
	
}
	
	
}

	
	
arsort($wordCount);
	
	
array_splice($wordCount$this->numWords);
	
	

	
	
$block 0;
	
	
$thisCount 0;
	
	
foreach(
$wordCount as $word=>$count)
	
	
{
	
	
	
if (
$thisCount++ >= 9)
	
	
	
{
	
	
	
	
$block++;
	
	
	
	
$thisCount 0;
	
	
	
}
	
	
	
$blocks[$block][$thisCount] = $word;
	
	
}

	
	
foreach(
$blocks as $idx=>$words)
	
	
{
	
	
	
foreach(
$words as $word)
	
	
	
	
$finalWords[$word] = $idx;
	
	
}

	
	
ksort($finalWords);
	
	
return 
$finalWords;
	
}
}
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #2 on: August 31, 2009, 01:44:44 PM »

If you made this recursive (like my directory crawler  Smooch ), it would take like 5 lines.
Logged

hai
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 9792



View Profile
« Reply #3 on: August 31, 2009, 02:41:05 PM »

I don't think so  Smooch

If you look it over a bit, I think you'll see that the part that would be replaced by recursion is only, like, 5 lines. The bulk of the code centers around distilling the content and extracting appropriate URLs to go after.

Recursion is hot, and particularly good for a Dir scraper. In this case however, I like serial collection of URLs and a FIFO list of links todo - it's easier to get my arms around and to debug.

<edit> ROFLMAO In fact, refactoring quickly in my head just to make sure I'm not being a tool, doing recursion correctly would add both complexity and lines I think...</edit>
« Last Edit: August 31, 2009, 02:47:50 PM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #4 on: August 31, 2009, 10:47:31 PM »

Shit I should finally take some time to learn classes in PHP. It's just that I know myself good enough to know that when I do I want to write my whole framework again with classes Wink
Logged

I met god and he had nothing to say to me.
deregular
Expert
****
Offline Offline

Posts: 172


View Profile
« Reply #5 on: September 01, 2009, 04:07:30 AM »

Join the club, hence my previous questions about codeigniter.

Thanks for this perks, will take a look at it later when I get a chance.

Looks nice.
Logged
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #6 on: September 01, 2009, 07:05:01 PM »

Yeah I'm sure it would, I was just giving you a hard time.
Logged

hai
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 9792



View Profile
« Reply #7 on: September 01, 2009, 07:07:29 PM »

 ROFLMAO well you made me look, Richard.
But it was good - thought I was an imbecile and had missed betterness for a moment.

Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #8 on: October 14, 2009, 06:36:58 AM »

I modded this a bit. It was really annoying when it spidered those #comment-xxx links in blogs so I did this:
if (!preg_match("~http://{$this->domain}~i", $url) OR preg_match("~#~i",$url))

I also made it multithreaded with CURL but can't post it here because it's like million classes with proxy throttling and other shit Mobster
Logged

I met god and he had nothing to say to me.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #9 on: October 14, 2009, 10:31:30 AM »

Oh my fucking god relative urls are A BITCH.
Logged

I met god and he had nothing to say to me.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 9792



View Profile
« Reply #10 on: October 14, 2009, 10:51:14 AM »

They really are. My full-fledged spider in the old archives handled them, but they suck mightily. I did a lot of work, trying to deal with them correctly.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #11 on: October 14, 2009, 11:06:23 AM »

They really are. My full-fledged spider in the old archives handled them, but they suck mightily. I did a lot of work, trying to deal with them correctly.
Yeah, I have been trying for the past hour to get them to work good but no... Plus it doesn't make it any easier that I'm modifying a script written by somebody else Smiley Even it is great, it's still not your own logic if you know what I mean Smiley

Also spiders are something that needs to be controlled quite heavily because you are unleashing automated bot on somebody's site and it's not good if it goes wild.
Logged

I met god and he had nothing to say to me.
nutballs
Administrator
Lifer
*****
Online Online

Posts: 5604


Back in my day we had 9 planets


View Profile
« Reply #12 on: October 14, 2009, 11:42:55 AM »

Whats the difficulty with relative URLs? am I missing something?

there are only 3 ways a url can start.
root which is a leading slash. Ok, just jam the domain in front.
local which is no leading slash. Ok just jam the whole current URL path in front.
absolute which is the whole URL already, so work is done.

what I miss?

i mean even a relative of ../something/../somethingelse/
just needs the current URL jammed in front of it. Which you have, because afterall, how else did you request the page?
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 9792



View Profile
« Reply #13 on: October 14, 2009, 11:55:35 AM »

Relative relates to the previous URL.

So if I add an href, "newPage.html" with no slash, then I look at the path of the LAST url and that's where I start. Obviously, a new level of PIA because there's new statefullness that needs to be added to the spider. Changes a spider that does one page at a time as well.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #14 on: October 14, 2009, 12:10:56 PM »

Whats the difficulty with relative URLs? am I missing something?

there are only 3 ways a url can start.
root which is a leading slash. Ok, just jam the domain in front.
local which is no leading slash. Ok just jam the whole current URL path in front.
absolute which is the whole URL already, so work is done.

what I miss?

i mean even a relative of ../something/../somethingelse/
just needs the current URL jammed in front of it. Which you have, because afterall, how else did you request the page?
Yeah but there's few other stuff to think about too. One problem is how deep you want to spider. If you start at www.domain.com/dir/, do you allow spider to get stuff from root? That's yet another thing you have to code. If you think it's simple, by all means do write us a piece of code that parses any gives href value to absolute url. Of course you have base url at your disposal Smiley
Logged

I met god and he had nothing to say to me.
Pages: [1] 2
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!