|
perkiset
|
 |
« on: August 31, 2009, 12:43:13 PM » |
|
OK, probably not the simplest, but certainly not a complicated thing. I have several WH sites that need a sitemap and searchability, but I am no longer in control of the content - the clients, or PinkHat, or any number of sources might contribute/change the content of a site. So I have a cron job that runs nightly and executes a little single-site crawler, then places searchable content into a database as well as creating the content for a static sitemap. It's really, really simple, but might offer the fledgling spider author some ideas. Here is the most basic way of calling it. The print_r at the end is just a really simple way of seeing everything that the object contains when it's done. Note that I don't do any real checking of the site, nor is it capable of handling relative URLs (I don't use any for my sites). $crawler = new simpleCrawler(); $crawler->domain = 'www.mydomain.com'; $crawler->crawl(); print_r($crawler);
Here is the code: <?php
class simpleCrawler {
private $todo, $done, $pageBuff, $currentURL;
public $domain, $site, $pages, $debug, $debugMax;
private function compileSite()
{
foreach($this->pages as $page)
{
foreach($page['internal'] as $url=>$dummy) $iTemp[$url] = true;
foreach($page['external'] as $url=>$dummy) $eTemp[$url] = true;
}
foreach($iTemp as $url=>$dummy) $this->site['internal'][] = $url;
foreach($eTemp as $url=>$dummy) $this->site['external'][] = $url;
sort($this->site['internal']);
sort($this->site['external']);
}
private function debug($msg) { if ($this->debug) echo "$msg\n"; }
private function extractContent()
{
// I do this because I like seeing the searchable before the raw when I print_r
$this->pages[$this->currentURL]['searchable'] = false;
$ptr = stripos($this->pageBuff, '<body');
$buff = trim(strip_tags(substr($this->pageBuff, $ptr, strlen($this->pageBuff))));
$buff = trim(str_ireplace(array(' ', chr(10), "\t"), ' ', $buff));
while (strpos($buff, ' ')) $buff = str_replace(' ', ' ', $buff);
$this->pages[$this->currentURL]['content'] = $buff;
$buff = trim(str_replace(array('-', '_', '.', ':', '/', '\\', ','), ' ', $buff));
$buff = preg_replace('/[^A-Z0-9 \r]/i', '', $buff);
while (strpos($buff, ' ')) $buff = str_replace(' ', ' ', $buff);
$words = explode(' ', $buff);
$outWords = array();
foreach($words as $word)
{
if (strlen($word) < 4) continue;
if (preg_match('/^[0-9]*$/', $word)) continue;
$outWords[] = $word;
}
$this->pages[$this->currentURL]['searchable'] = implode(' ', $outWords);
}
private function extractTitle()
{
preg_match('/<title>(.*)<\/title>/i', $this->pageBuff, $parts);
$this->pages[$this->currentURL]['title'] = $parts[1];
}
private function extractURLs()
{
$this->pages[$this->currentURL]['internal'] = array();
$this->pages[$this->currentURL]['external'] = array();
preg_match_all('/<a href="([^"]*)/i', $this->pageBuff, $thisArr);
foreach($thisArr[1] as $url)
{
$url = trim($url);
if ($url[0] == '/') $url = "http://{$this->domain}$url";
if (
(preg_match('~^https://~i', $url)) or
(strpos($url, '?')) or
(preg_match('/^mailto/i', $url)) or
(preg_match('/\.pdf$/i', $url))
)
continue;
if (!preg_match("~http://{$this->domain}~i", $url))
{
$this->pages[$this->currentURL]['external'][$url] = true;
$this->debug("DENY $url");
continue;
}
$this->pages[$this->currentURL]['internal'][$url] = true;
if (!$this->done[$url])
{
if (!in_array($url, $this->todo))
{
$this->todo[] = $url;
$this->debug("TODO $url");
}
} else $this->debug("DONE $url");
}
}
public function crawl()
{
if (!$this->domain)
throw new Exception('simpleCrawler: You cannot crawl without specifying a domain.');
$this->pages = array();
$this->site = array();
$this->todo = array();
$this->done = array();
$this->todo[] = "http://{$this->domain}/";
$counter = 0;
while (count($this->todo))
{
$thisURL = $this->currentURL = array_shift($this->todo);
$this->debug("\n\nCRAWL $thisURL");
$this->done[$thisURL] = true;
$this->pageBuff = file_get_contents($thisURL);
$this->pages[$this->currentURL]['url'] = $thisURL;
$this->extractTitle();
$this->extractURLs();
$this->extractContent();
$ptr = count($this->urlList);
if ($this->debug)
if ($this->debugMax)
if ($counter++ > $this->debugMax)
break;
}
$this->compileSite();
}
}
Enjoy! <add> Added so that VSloathe would be more impressed with the output:  After the crawl is done, you have $crawler->pages which is an array of arrays containing the internal & external links on each page, as well as each page's title, display content and searchable content for the database. You also have $crawler->site['internal'] and $crawler->site['external'] which contains all the internal links and external links on the entire site, most helpful for building a sitemap. </add>
|
|
|
|
« Last Edit: August 31, 2009, 02:43:47 PM by perkiset »
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
perkiset
|
 |
« Reply #1 on: August 31, 2009, 12:45:25 PM » |
|
As an example of how to go forward, here is a small descendant class that has a cloud keyword creator. The added function will return the top 100 words in a site, ranked 0..9 (0 being the highest, 9 being the lowest) so that you could they style them accordingly to create your keyword cloud. class simpleCrawlerWithCloud extends simpleCrawler {
public $excludes;
public $numWords;
function __construct()
{
$this->excludes = array(
'about', 'above', 'added', 'after', 'again', 'asked', 'began', 'begin', 'below', 'chose', 'clear', 'close', 'comes',
'could', 'doing', 'enter', 'every', 'extra', 'feels', 'final', 'first', 'found', 'front', 'given', 'gives', 'going',
'great', 'happy', 'known', 'knows', 'label', 'large', 'lasts', 'later', 'leave', 'links', 'lived', 'lives', 'lower',
'makes', 'means', 'meant', 'meets', 'moved', 'needs', 'never', 'newer', 'often', 'older', 'other', 'piece', 'place',
'since', 'sizes', 'small', 'start', 'still', 'taken', 'takes', 'tells', 'thank', 'thats', 'their', 'there', 'these',
'thing', 'those', 'under', 'until', 'upper', 'using', 'usual', 'wants', 'wasnt', 'where', 'which', 'while', 'whole',
'whose', 'words', 'world', 'worst', 'worth', 'would');
$this->numWords = 99; // actually, 100 items, zero indexed
}
public function &getWords()
{
// This will build a hashed associative array of all the words
// that are to be excluded, including anything that the user added.
foreach($this->excludes as $word) $excludes[$word] = true;
$wordCount = array();
foreach($this->pages as $page)
{
$words = explode(' ', $page['searchable']);
foreach($words as $word)
{
if (strlen($word) < 5) continue;
if ($excludes[$word]) continue;
$wordCount[strtolower($word)]++;
}
}
arsort($wordCount);
array_splice($wordCount, $this->numWords);
$block = 0;
$thisCount = 0;
foreach($wordCount as $word=>$count)
{
if ($thisCount++ >= 9)
{
$block++;
$thisCount = 0;
}
$blocks[$block][$thisCount] = $word;
}
foreach($blocks as $idx=>$words)
{
foreach($words as $word)
$finalWords[$word] = $idx;
}
ksort($finalWords);
return $finalWords;
} }
|
|
|
|
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
vsloathe
|
 |
« Reply #2 on: August 31, 2009, 01:44:44 PM » |
|
If you made this recursive (like my directory crawler  ), it would take like 5 lines.
|
|
|
|
|
Logged
|
hai
|
|
|
|
perkiset
|
 |
« Reply #3 on: August 31, 2009, 02:41:05 PM » |
|
I don't think so  If you look it over a bit, I think you'll see that the part that would be replaced by recursion is only, like, 5 lines. The bulk of the code centers around distilling the content and extracting appropriate URLs to go after. Recursion is hot, and particularly good for a Dir scraper. In this case however, I like serial collection of URLs and a FIFO list of links todo - it's easier to get my arms around and to debug. <edit>  In fact, refactoring quickly in my head just to make sure I'm not being a tool, doing recursion correctly would add both complexity and lines I think...</edit>
|
|
|
|
« Last Edit: August 31, 2009, 02:47:50 PM by perkiset »
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
kurdt
|
 |
« Reply #4 on: August 31, 2009, 10:47:31 PM » |
|
Shit I should finally take some time to learn classes in PHP. It's just that I know myself good enough to know that when I do I want to write my whole framework again with classes 
|
|
|
|
|
Logged
|
I met god and he had nothing to say to me.
|
|
|
|
deregular
|
 |
« Reply #5 on: September 01, 2009, 04:07:30 AM » |
|
Join the club, hence my previous questions about codeigniter.
Thanks for this perks, will take a look at it later when I get a chance.
Looks nice.
|
|
|
|
|
Logged
|
|
|
|
|
vsloathe
|
 |
« Reply #6 on: September 01, 2009, 07:05:01 PM » |
|
Yeah I'm sure it would, I was just giving you a hard time.
|
|
|
|
|
Logged
|
hai
|
|
|
|
perkiset
|
 |
« Reply #7 on: September 01, 2009, 07:07:29 PM » |
|
 well you made me look, Richard. But it was good - thought I was an imbecile and had missed betterness for a moment.
|
|
|
|
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
kurdt
|
 |
« Reply #8 on: October 14, 2009, 06:36:58 AM » |
|
I modded this a bit. It was really annoying when it spidered those #comment-xxx links in blogs so I did this: if (!preg_match("~http://{$this->domain}~i", $url) OR preg_match("~#~i",$url)) I also made it multithreaded with CURL but can't post it here because it's like million classes with proxy throttling and other shit 
|
|
|
|
|
Logged
|
I met god and he had nothing to say to me.
|
|
|
|
kurdt
|
 |
« Reply #9 on: October 14, 2009, 10:31:30 AM » |
|
Oh my fucking god relative urls are A BITCH.
|
|
|
|
|
Logged
|
I met god and he had nothing to say to me.
|
|
|
|
perkiset
|
 |
« Reply #10 on: October 14, 2009, 10:51:14 AM » |
|
They really are. My full-fledged spider in the old archives handled them, but they suck mightily. I did a lot of work, trying to deal with them correctly.
|
|
|
|
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
kurdt
|
 |
« Reply #11 on: October 14, 2009, 11:06:23 AM » |
|
They really are. My full-fledged spider in the old archives handled them, but they suck mightily. I did a lot of work, trying to deal with them correctly.
Yeah, I have been trying for the past hour to get them to work good but no... Plus it doesn't make it any easier that I'm modifying a script written by somebody else  Even it is great, it's still not your own logic if you know what I mean  Also spiders are something that needs to be controlled quite heavily because you are unleashing automated bot on somebody's site and it's not good if it goes wild.
|
|
|
|
|
Logged
|
I met god and he had nothing to say to me.
|
|
|
|
nutballs
|
 |
« Reply #12 on: October 14, 2009, 11:42:55 AM » |
|
Whats the difficulty with relative URLs? am I missing something?
there are only 3 ways a url can start. root which is a leading slash. Ok, just jam the domain in front. local which is no leading slash. Ok just jam the whole current URL path in front. absolute which is the whole URL already, so work is done.
what I miss?
i mean even a relative of ../something/../somethingelse/ just needs the current URL jammed in front of it. Which you have, because afterall, how else did you request the page?
|
|
|
|
|
Logged
|
I could eat a bowl of Alphabet Soup and shit a better argument than that.
|
|
|
|
perkiset
|
 |
« Reply #13 on: October 14, 2009, 11:55:35 AM » |
|
Relative relates to the previous URL.
So if I add an href, "newPage.html" with no slash, then I look at the path of the LAST url and that's where I start. Obviously, a new level of PIA because there's new statefullness that needs to be added to the spider. Changes a spider that does one page at a time as well.
|
|
|
|
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
kurdt
|
 |
« Reply #14 on: October 14, 2009, 12:10:56 PM » |
|
Whats the difficulty with relative URLs? am I missing something?
there are only 3 ways a url can start. root which is a leading slash. Ok, just jam the domain in front. local which is no leading slash. Ok just jam the whole current URL path in front. absolute which is the whole URL already, so work is done.
what I miss?
i mean even a relative of ../something/../somethingelse/ just needs the current URL jammed in front of it. Which you have, because afterall, how else did you request the page?
Yeah but there's few other stuff to think about too. One problem is how deep you want to spider. If you start at www.domain.com/dir/, do you allow spider to get stuff from root? That's yet another thing you have to code. If you think it's simple, by all means do write us a piece of code that parses any gives href value to absolute url. Of course you have base url at your disposal 
|
|
|
|
|
Logged
|
I met god and he had nothing to say to me.
|
|
|
|