The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. December 05, 2008, 10:10:37 AM

Login with username, password and session length


Pages: 1 2 [3]
  Print  
Author Topic: Perk's Spider Source  (Read 2640 times)
kurdt
Rookie
**
Offline Offline

Posts: 14


View Profile
« Reply #30 on: August 10, 2008, 03:32:43 AM »

Hmm.. this works great but for some reason it seems to get stuck sometimes.. just printing w few million times until I manually stop it. Any idea if you could implement a simple failsafe that aborts the thread after 50 failed tries or something? Or maybe I'll do it myself when I have the time..
Logged
kurdt
Rookie
**
Offline Offline

Posts: 14


View Profile
« Reply #31 on: August 10, 2008, 08:35:05 AM »

Yet another questions.. is this suppose to check every page again from crawl_pages when you run crawler.php? Or do I need to manually delete some rows to prevent it from spidering already spidered domains and pages?
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 5230


:sniffle: Humor was so much easier before.


View Profile
« Reply #32 on: August 10, 2008, 11:22:34 AM »

As I recall, this code grabs URLs and checks them into the database... if they're already there, it does not add them again. As spiderlets are dispatched, they handle exactly one URL from the database. As it is handled, it is flagged so that it does not need to be done again. This is how it makes sure it gets a whole site, but doesn't loop around and keep doing the same URLs over and over again.

I don't know why it would be showing a "w" and then just getting hung... this means that the dispatcher is waiting for there to be at least one free spiderlet slot for it to fire off another one. It is unlikely that the spiderlets are getting hung and staying alive - it is more likely that they are dying out on an error and not updating the database to let the dispatcher know that there is a free slot.

Do a ps aux and see if there are any spiderlets running. Then check the db spiderlets table... I'll bet that all slots are "in use" yet there are urls to do, which is why the dispatcher is just hanging there. Have no idea why this would be, but if you simply change the status in the database to not in use the dispatcher will see that and start firing off spiders again.

My new way of doing this is for the dispatcher to do it's own ps aux and evaluate actual processes rather than a database "lock" which has given me troubles in other areas. Hope this helps!

/p
Logged

If I can't be Mr. Root then I don't want to play.
Pages: 1 2 [3]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!