The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. December 05, 2008, 10:21:08 AM

Login with username, password and session length


Pages: 1 [2] 3
  Print  
Author Topic: Perk's Spider Source  (Read 2642 times)
thedarkness
Global Moderator
Lifer
*****
Offline Offline

Posts: 581



View Profile
« Reply #15 on: April 27, 2007, 04:45:18 PM »

I always carry an extra one in my............. well........ never mind....
Logged

"I want to be the guy my dog thinks I am."
 - Unknown
Caligula
Rookie
**
Offline Offline

Posts: 39



View Profile
« Reply #16 on: April 27, 2007, 06:09:19 PM »

oh thats just wrong....
Logged
Dbyt3r
Rookie
**
Offline Offline

Posts: 19


View Profile
« Reply #17 on: April 28, 2007, 07:47:11 AM »

Now, all you need is one gigantic AI script or regex script to identify every page as trackback | blog | guestbook etc and spam them accordingly ; Applause
Logged
krisis
n00b
*
Offline Offline

Posts: 4


View Profile
« Reply #18 on: April 17, 2008, 04:53:02 PM »

This is really nice work Perk!

I've been looking for ways that I could do something similar, though it would be to pull down and process single pages across a large number of URLs queued up in a database. This sort of system should work well. I was originally looking at writing a daemon and forking processes for each spider, etc. but I like the simplicity of what you've done here.

If I understand your code correctly, it executes a background process for each spiderlet, up to 10 max at any time. The crawler script finishes when there are no more pages in the database to spider.

Just a few questions:

1) What is the maximum number of spiderlets you've tried successfully?
2) I assume this could be changed to run instead with a WHILE(TRUE), and a few other modifications, so that it runs like a service that keep checking the database for new URLs to spider instead of stopping when there are no more left in queue?
3) Can you please explain the CrawlState field? 1 seems to be for pages waiting for crawl, 0 for pages that have been crawled, and -1 for currently being crawled or failed?
4) What are the LastPing and NextPing fields for? they don't seem to be used within your code.

Thanks again.
- Krisis


Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 5230


:sniffle: Humor was so much easier before.


View Profile
« Reply #19 on: April 17, 2008, 05:47:08 PM »

This is really nice work Perk!
Thanks krisis, and welcome to the Cache.

If I understand your code correctly, it executes a background process for each spiderlet, up to 10 max at any time. The crawler script finishes when there are no more pages in the database to spider.
I haven't looked at this version of my spider for a while, because this is actually a scaled down version for posting here. Fair to say that I've kept a little bit of juice for myself Wink

1) What is the maximum number of spiderlets you've tried successfully?
In my production version, I have a table called "sys_vars" that the dispatcher looks to to decide how many can be run at once. Rather than semaphores or any other complexity, I just do a shell_exec('ps aux') into a var then regex out the spider processes so that I get a current count - much more effective than any other forms of communication. And I can also monitor how long a spiderlet has been working on a single URL to see if it needs to be killed. I don't run this on dedicated machines, so I've never really pushed it beyond about 100 spiderlets. Remember that the real bottleneck is in waiting for the response, not processing... so this type of system is pretty light on the processor.


2) I assume this could be changed to run instead with a WHILE(TRUE), and a few other modifications, so that it runs like a service that keep checking the database for new URLs to spider instead of stopping when there are no more left in queue?
Absolutely. My email blaster works on a similar system, although it only does [a certain amount of emails] per minute, and then a cron job re-fires it up once a minute. The combination of number-of-blastlets and at-a-certain-interval cron mechanism makes throttling really easy, which is particularly important with eblasting.

3) Can you please explain the CrawlState field? 1 seems to be for pages waiting for crawl, 0 for pages that have been crawled, and -1 for currently being crawled or failed?
Pretty much right on, couple more details. (I don't remember all the state codes, but I'll expand on what I use it for)

When I respider a site, I'll set everything I know about to a state of 0 so that I know what needs to be crawled. If the page is no longer available I know it... if the page is newly orphaned I know it and so forth. The "newly orphaned" feature is particularly handy when working with my own sites and I want to know if I am dropping anything important for the search engines. In my production version I also use states to identify 301s and 302s, temporarily unavailables etc.

4) What are the LastPing and NextPing fields for? they don't seem to be used within your code.
I watch sites that are linked to me as well... I want to see the last time I checked [that] page and have a schedule for when I next will check it. So imagine a cron job that is looking for any page that has "come due" and needs to be spidered, but I don't do anything else except that.

Hope that helps... you might find another thread I wrote up on "Pseudo multithreading with PHP and Apache" which talks about using the Apache server to throw multiple threads via a web request... I like that technique quite a bit as well.

Again, welcome to the Cache.
/perk
Logged

If I can't be Mr. Root then I don't want to play.
krisis
n00b
*
Offline Offline

Posts: 4


View Profile
« Reply #20 on: April 17, 2008, 06:31:20 PM »

Quote
Thanks krisis, and welcome to the Cache.
Thanks. You have some great code samples and discussions here that are proving very useful to me. I've been looking all over the place for stuff on multi-threading (or pseudo multi-threading) with PHP for this project, and this is definitely the most elegant solution I've come across.

Quote
Hope that helps... you might find another thread I wrote up on "Pseudo multithreading with PHP and Apache" which talks about using the Apache server to throw multiple threads via a web request... I like that technique quite a bit as well.
Looks interesting. It seems that it would be more memory efficient than this exec > /dev/null & method. Do you see any other pros/cons of writing a spider to use your pseudo multithreading method instead?

I also plan on using a mysql database for communication between the "threads" and to keep track of how many are running, their status, etc. This would also allow it to be scalable in case my requirements grow, so that I can have multiple servers running instances of the spider and all communicating through the database.

Quote
this is actually a scaled down version for posting here. Fair to say that I've kept a little bit of juice for myself
Of course Smiley However, if you've got any further optimizations or things to look out for which would be useful to me, I'd appreciate any extra advice to make my life easier and avoid any major issues.

Thanks again.
- Krisis
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 5230


:sniffle: Humor was so much easier before.


View Profile
« Reply #21 on: April 17, 2008, 06:59:01 PM »

Thanks. You have some great code samples and discussions here that are proving very useful to me. I've been looking all over the place for stuff on multi-threading (or pseudo multi-threading) with PHP for this project, and this is definitely the most elegant solution I've come across.
Most kind.

Looks interesting. It seems that it would be more memory efficient than this exec > /dev/null & method. Do you see any other pros/cons of writing a spider to use your pseudo multithreading method instead?
The main benefit of using this exec'd mechanism is simplicity. It is heavier on RAM than the Apache method but is hugely stable and relies on tried and true OS mechanisms to get done ie., whether Apache or anything else is running, this will still go.

The main benefit of using the Apache method is that you're only pulling the trigger on one instance of PHP (the one that Apache is hanging on to). This is most important to me if I need access to the APC cache, which you cannot get to if you are run in a shell. The downside, is that if you have a busy server, you're using up handles for Apache clients ie., if you have a max of 500 connections available on your instance of Apache and you run 100 spiders, then you've just knocked your web capability by 20%.


I also plan on using a mysql database for communication between the "threads" and to keep track of how many are running, their status, etc. This would also allow it to be scalable in case my requirements grow, so that I can have multiple servers running instances of the spider and all communicating through the database.
Well, you don't REQUIRE a database for thread communication, because the number of threads running is really machine dependent, no? In other words, the dispatcher gets the "next" url from the database, then dispatches a spiderlet on {this} machine... why couldn't you have bunches of machines all working off the same database for URLs, but handling the number of threads locally via the shell_exec function? You could have busy web servers doing no more than 10 threads at a time, and light ones doing 100 at a time - all locally managed rather than centrally which would be cool.

Of course Smiley However, if you've got any further optimizations or things to look out for which would be useful to me, I'd appreciate any extra advice to make my life easier and avoid any major issues.
Nothing comes immediately to mind... but ping here if you get stuck or have an idea.

Good luck!
/p
Logged

If I can't be Mr. Root then I don't want to play.
krisis
n00b
*
Offline Offline

Posts: 4


View Profile
« Reply #22 on: April 17, 2008, 07:34:16 PM »

Quote
The main benefit of using the Apache method is that you're only pulling the trigger on one instance of PHP (the one that Apache is hanging on to). This is most important to me if I need access to the APC cache, which you cannot get to if you are run in a shell. The downside, is that if you have a busy server, you're using up handles for Apache clients ie., if you have a max of 500 connections available on your instance of Apache and you run 100 spiders, then you've just knocked your web capability by 20%
I intend on running this on dedicated machines without anything else running on them. i.e. machines just for spidering. Given this, would you recommend the exec mechanism over the apache method? It seems that it may be more stable, but may not allow for as many threads (due to increased memory requirements).

Quote
why couldn't you have bunches of machines all working off the same database for URLs, but handling the number of threads locally via the shell_exec function?
Yes, this is exactly what I plan on doing. The threads don't really need to communicate with each other, but I would still like to monitor the status of each machine and the spider threads running on them, which is where the database communication comes in handy. I even hope to implement this on Amazon EC2 and have it automatically scale the number of machines in use based on the number of URLs in queue.

Quote
Nothing comes immediately to mind... but ping here if you get stuck or have an idea.
Will do. Thanks again for all your help!

- krs

Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 5230


:sniffle: Humor was so much easier before.


View Profile
« Reply #23 on: April 17, 2008, 09:00:17 PM »

I intend on running this on dedicated machines without anything else running on them. i.e. machines just for spidering. Given this, would you recommend the exec mechanism over the apache method? It seems that it may be more stable, but may not allow for as many threads (due to increased memory requirements).
Wow, tough one - haven't thought about it that way really, because the mechanical benefits of each have always defined the way I code something over the resource requirements. Given the pretty small footprint of both, I think it'd be 6 of one etc etc. I still use the shell method for this sort of job - I use the Apache method when I want to spawn multithreaded things via a web call.
Logged

If I can't be Mr. Root then I don't want to play.
stma
n00b
*
Offline Offline

Posts: 4


View Profile
« Reply #24 on: June 16, 2008, 01:10:11 PM »

Just curious if anyone else has run into the problem of the script not doing anything. 

If I run through a browser I get a blank page. No errors, etc..

If I run through command line it does nothing. No errors, etc..

Few things to note -  if I take out the comments where it prints the session id I'm not creating a variable there - but even if I hard code one in things don't work, nothing is entered into the database.

I've populated the database with a few sites as well - so it has data to grab.

Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 5230


:sniffle: Humor was so much easier before.


View Profile
« Reply #25 on: June 16, 2008, 01:13:12 PM »

sounds fishy... set error_reporting to E_ALL and see if there's any difference.
Logged

If I can't be Mr. Root then I don't want to play.
stma
n00b
*
Offline Offline

Posts: 4


View Profile
« Reply #26 on: June 16, 2008, 01:36:12 PM »

Your right.... there was something fishy....

[16-Jun-2008 15:18:10] PHP Notice:  Undefined variable: testResp in /home/siteuser/public_html/stuff/class.webrequest.php on line 169
[16-Jun-2008 15:18:10] PHP Notice:  Undefined index:  sessionid in /home/siteuser/public_html/stuff/class.webrequest.php on line 271

Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 5230


:sniffle: Humor was so much easier before.


View Profile
« Reply #27 on: June 16, 2008, 01:53:06 PM »

Is all the MySQL stuff in place, and does the spider have access to those tables?
Logged

If I can't be Mr. Root then I don't want to play.
stma
n00b
*
Offline Offline

Posts: 4


View Profile
« Reply #28 on: June 16, 2008, 02:12:06 PM »

I've been having connection problems all day - when I seeded the crawl_pages table it didn't stick.  Was just coming back here to post it was my mistake and not yours about it not running Smiley

I'm still getting those errors - but it appears to be running (printing out the w).

I'll have to go check my server settings - might have sockets off or something.

Thanks...
« Last Edit: June 16, 2008, 02:30:35 PM by stma » Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 5230


:sniffle: Humor was so much easier before.


View Profile
« Reply #29 on: June 16, 2008, 05:09:16 PM »

The 'w' means that a thread is waiting for a response I believe... if it's just kicking that our every few ticks then there's definitely a problem with the spiderlet getting out and generating a request. Try to run the spiderlet all by itself, rather than the crawler so that it can spit out it's error messages directly to you. That's the best way to debug the outbound connection.
Logged

If I can't be Mr. Root then I don't want to play.
Pages: 1 [2] 3
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!