The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register.
Did you miss your activation email?
May 25, 2012, 08:48:50 AM

Login with username, password and session length


Pages: 1 [2]
  Print  
Author Topic: The World's Simplest Crawler  (Read 1643 times)
webinfoguy25
Rookie
**
Offline Offline

Posts: 34


View Profile
« Reply #15 on: October 21, 2009, 11:10:56 AM »

OK, probably not the simplest, but certainly not a complicated thing.

I have several WH sites that need a sitemap and searchability, but I am no longer in control of the content - the clients, or PinkHat, or any number of sources might contribute/change the content of a site.

So I have a cron job that runs nightly and executes a little single-site crawler, then places searchable content into a database as well as creating the content for a static sitemap. It's really, really simple, but might offer the fledgling spider author some ideas. Here is the most basic way of calling it. The print_r at the end is just a really simple way of seeing everything that the object contains when it's done. Note that I don't do any real checking of the site, nor is it capable of handling relative URLs (I don't use any for my sites).

$crawler = new simpleCrawler();
$crawler->domain 'www.mydomain.com';
$crawler->crawl();
print_r($crawler);

Here is the code:
<?php

class simpleCrawler
{
	
private 
$todo$done$pageBuff$currentURL;
	
public 
$domain$site$pages$debug$debugMax;
	

	
private function 
compileSite()
	
{
	
	
foreach(
$this->pages as $page)
	
	
{
	
	
	
foreach(
$page['internal'] as $url=>$dummy$iTemp[$url] = true;
	
	
	
foreach(
$page['external'] as $url=>$dummy$eTemp[$url] = true;
	
	
}

	
	
foreach(
$iTemp as $url=>$dummy$this->site['internal'][] = $url;
	
	
foreach(
$eTemp as $url=>$dummy$this->site['external'][] = $url;
	
	

	
	
sort($this->site['internal']);
	
	
sort($this->site['external']);
	
}
	

	
private function 
debug($msg) { if ($this->debug) echo "$msg\n"; }
	

	
private function 
extractContent()
	
{
	

	
	
// I do this because I like seeing the searchable before the raw when I print_r
	
	
$this->pages[$this->currentURL]['searchable'] = false;
	
	

	
	
$ptr stripos($this->pageBuff'<body');
	
	
$buff trim(strip_tags(substr($this->pageBuff$ptrstrlen($this->pageBuff))));
	
	
$buff trim(str_ireplace(array('&nbsp;'chr(10), "\t"), ' '$buff));
	
	
while (
strpos($buff'  '))  $buff str_replace('  '' '$buff);
	
	
$this->pages[$this->currentURL]['content'] = $buff;
	
	

	
	
$buff trim(str_replace(array('-''_''.'':''/''\\'','), ' '$buff));
	
	
$buff preg_replace('/[^A-Z0-9 \r]/i'''$buff);
	
	
while (
strpos($buff'  '))  $buff str_replace('  '' '$buff);
	
	
$words explode(' '$buff);
	
	
$outWords = array();
	
	
foreach(
$words as $word)
	
	
{
	
	
	
if (
strlen($word) < 4) continue;
	
	
	
if (
preg_match('/^[0-9]*$/'$word)) continue;
	
	
	
$outWords[] = $word;
	
	
}
	
	
$this->pages[$this->currentURL]['searchable'] = implode(' '$outWords);
	
}
	

	
private function 
extractTitle()
	
{
	
	
preg_match('/<title>(.*)<\/title>/i'$this->pageBuff$parts);
	
	
$this->pages[$this->currentURL]['title'] = $parts[1];
	
}

	
private function 
extractURLs()
	
{
	
	
$this->pages[$this->currentURL]['internal'] = array();
	
	
$this->pages[$this->currentURL]['external'] = array();

	
	
preg_match_all('/<a href="([^"]*)/i'$this->pageBuff$thisArr);
	
	
foreach(
$thisArr[1] as $url)
	
	
{
	
	
	
$url trim($url);
	
	
	
if (
$url[0] == '/'$url "http://{$this->domain}$url";
	
	
	

	
	
	
if (
	
	
	
	
(
preg_match('~^https://~i'$url)) or
	
	
	
	
(
strpos($url'?')) or
	
	
	
	
(
preg_match('/^mailto/i'$url)) or
	
	
	
	
(
preg_match('/\.pdf$/i'$url))
	
	
	
	
)
	
	
	
	
continue;
	
	
	

	
	
	
if (!
preg_match("~http://{$this->domain}~i"$url))
	
	
	
{
	
	
	
	
$this->pages[$this->currentURL]['external'][$url] = true;
	
	
	
	
$this->debug("DENY
	
$url");
	
	
	
	
continue;
	
	
	
}
	
	
	

	
	
	
$this->pages[$this->currentURL]['internal'][$url] = true;
	
	
	
if (!
$this->done[$url])
	
	
	
{
	
	
	
	
if (!
in_array($url$this->todo))
	
	
	
	
{
	
	
	
	
	
$this->todo[] = $url;
	
	
	
	
	
$this->debug("TODO
	
$url");
	
	
	
	
}
	
	
	
} else 
$this->debug("DONE
	
$url"); 
	
	
}
	
}
	
 
	
public function 
crawl()
	
{
	
	
if (!
$this->domain
	
	
	
throw new 
Exception('simpleCrawler: You cannot crawl without specifying a domain.');
	
	

	
	
$this->pages = array();
	
	
$this->site = array();
	
	
$this->todo = array();
	
	
$this->done = array();
	
	
$this->todo[] = "http://{$this->domain}/";
	
	

	
	
$counter 0;
	
	
while (
count($this->todo))
	
	
{
	
	
	
$thisURL $this->currentURL array_shift($this->todo);
	
	
	
$this->debug("\n\nCRAWL
	
$thisURL");
	
	
	
$this->done[$thisURL] = true;
	
	
	
$this->pageBuff file_get_contents($thisURL);
	
	
	
$this->pages[$this->currentURL]['url'] = $thisURL;
	
	
	
$this->extractTitle();
	
	
	
$this->extractURLs();
	
	
	
$this->extractContent();

	
	
	
$ptr count($this->urlList);

	
	
	
if (
$this->debug)
	
	
	
	
if (
$this->debugMax)
	
	
	
	
	
if (
$counter++ > $this->debugMax)
	
	
	
	
	
	
break;
	
	
}
	
	

	
	
$this->compileSite();
	
}
	

}

Enjoy!

<add>
Added so that VSloathe would be more impressed with the output:  ROFLMAO
After the crawl is done, you have $crawler->pages which is an array of arrays containing the internal & external links on each page, as well as each page's title, display content and searchable content for the database. You also have $crawler->site['internal'] and $crawler->site['external'] which contains all the internal links and external links on the entire site, most helpful for building a sitemap.
</add>


Howcome you don't use relative URLS?
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 9896



View Profile
« Reply #16 on: October 21, 2009, 11:24:58 AM »

Relative URLs are an interesting tool and can be a REAL pain for a crawler. Relative URLs are based off where you last were ie., if your last URL was "aDir/anotherDir/file.html" and your next URL is "../../aDifferentDir/newFile.html" ... how do you really know where you are in the directory structure? The problem is the URL right before my first example: how do we know where THAT was?

HTTP is not supposed to be a stateful protocol, but relative URLs make it so. In this case, the browser needs to keep careful watch of where it has been so that it can walk the chain of relativeness to see where (the current URL) actually is. It's a big pain for spiders and a big pain for site developers... so I never use them. All in all there is probably a benefit in SEO to this as well, because URLs are all very very clean, but I can't say that for sure.

IMO every url should be "/root/aDir/myFile.html" or even moreso, "http://www .mydomain .com/root/aDir/myFile.html".
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
webinfoguy25
Rookie
**
Offline Offline

Posts: 34


View Profile
« Reply #17 on: October 21, 2009, 11:31:34 AM »

Do you have documentation of your views regarding relative URL's I would like to use for work as in I always like to show documentation of what suggestions I make for a site.

Thanks,
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 9896



View Profile
« Reply #18 on: October 21, 2009, 11:48:00 AM »

Hmm, documentation, no.

This is simply experience. Consider this stream of requested urls from a browser:

/index.html
deeper1/page1.html
deeper2/another.html
deeper3/yetanother.html
../../besideWhat/whereAmI.html
../adifferentDir/deeperDir/anotherFile.html
yetanother/file.html

Please give me the hard location of the file, "yetanother/file.html" using ONLY the last two URLs. Can't be done, right? because you've been relative for quite a while in the URL request stream. Browsers and servers have to handle this: Apache handles it for you, but if you don't have a file in the right place you have troubles. Your local browser handles it for you because it has to to deal with the spec of relative URLs on a site.

There was (once upon a time) a blink attribute in HTML. It was simply, < blink > - and it was HORRIBLY misused. It was awful. Miserable. So they deprecated it because although it looked good on paper, in real usage it was a bastard. That's the same as relative URLs. They mimic the relative nature of a *nix directory structure (the original essence of the web) but in real practice, and taken to extreme like my example above, they are horrible.

So don't use them. Because all that said, if you make it tough on a spider to figure out where a page is, it'll just give up on you - or ask for the wrong thing and get a 404 and think that the page is bad. "Just say no."
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #19 on: October 21, 2009, 12:20:10 PM »

but dont you store the URL that you got THAT url from?
If not, thats why its tuff.
I store URLs with references to the source URL it came from within the same table. Self referential keys.
And its very schizo to write this sure:
select * from tableA inner join tableA on fk_ID=ID where fk_ID=$id
Though of course I am missing the AS parts. but still...

but of course, make it easier on spiders.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 9896



View Profile
« Reply #20 on: October 21, 2009, 12:44:33 PM »

Yes ... when I spider and am handling relatives, I un-relative them immediately before I put them in the ToDo list. It's not that it can't be done, it's just a PIA. And yes, I'm forgetting that the current URL is expanded out so you can pick it up from the referrer, so I'm exaggerating the problem.

But I still hate them Smiley
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Pages: 1 [2]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!