The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 17, 2019, 05:47:04 PM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: Quick & Dirty URL Spider  (Read 4625 times)
Caligula
Rookie
**
Offline Offline

Posts: 39



View Profile
« on: May 12, 2007, 10:05:28 PM »

Well this is what I have been working on... I want to thank everyone here and at the Syndk8....(same people, I know Grin)....for all their help. This probably wont be much to you pros, but hey if any noobs happen by maybe it will help them learn.

Simple spider, just collects clean URLs and spiders for new.

It requires a database, I created mine manually through phpMyadmin - because its easier and I don't trust the script to do it correctly.

ID->INT->auto_increment->PRIMARY Key
urls->VARCHAR(100)->UNIQUE Key

Those are the only fields it uses - the unique key keeps duplicate URLs from filling up the DB.

* At around 35 seconds the spider tends to stop - now I haven't figured out if this is because the window times out or because the spider dead ends BUT it will continue to return data! In the test runs, it continued to spider and return URLs for 15 mins After the script had essentially stopped collecting approx. 2,334 URLs.

* The JavaScript at the beginning is a live timer - Its the result of about 2 hours worth of JS programming experience and is there to allow me to time the run and measure results. The spider will run without it.


Code:

<HTML>
<HEAD>
<style>span{cursor:pointer;color:white;background:black;}</style>
<script type="text/javascript">
var msec=0
var sec=0
var min=0
function start(){
document.forms[0].display.value=min+":"+sec+":"+msec
go=setTimeout("start()",1)
msec++
if(msec==100){
msec=0
sec++
}
if(sec==60){
sec=0
min++
}}
function stopspider(){
clearTimeout(go);
}
function over(color)
{document.getElementById('over').style.background=color}
function out(color)
{document.getElementById('over').style.background=color}
</script>
</head>
<body>
<div align="center" style="width:10em;position:absolute;left:300px;top:4em;">
<form><input type="text" name="display"size="22"value="00:00:00"></form>
<script language="JavaScript">start();</script>
<span id="over" onmouseover="over('red')" onmouseout="out('black')" onclick="stopspider(window.stop())"><b>Stop Spider!</b></span>
<br><br>
<?php
// Spider Build v1.04 Beta CBBWSB.O

// Keep Script From Timing Out
set_time_limit(0);
while (
$i<=100){
echo 
"$i";
sleep(25);
$i++;}

// Main Connect To DB
$db=mysql_connect ("localhost""USER""PASS") or die 
(
'I cannot connect to the database because: ' mysql_error());
mysql_select_db ("DB NAME"$db);
$table "urls";

$key keyword;// Keyword You're Searching For
$stn 0;// Number of Results To Start At
$grab 10;// Number of Results To Grab

// Start The Spider
$ch curl_init();
curl_setopt($chCURLOPT_URL,"http://www.google.com/search?safe=off&q=$key&start=$stn&num=$grab&sa=N");
curl_setopt($chCURLOPT_RETURNTRANSFER,1);
$result=curl_exec ($ch);
curl_close ($ch);

if( 
$result ){

  
preg_match_all'/(http:\/\/w{3}\.[^.]+?\.[a-z]{3})/'$result$outputPREG_SET_ORDER );
  foreach( 
$output as $item ){

// Write Initial Data
mysql_select_db ("DB NAME"$db);
$sqlquery "INSERT INTO $table
VALUES('
$id','$item[1]')";
$results mysql_query($sqlquery);

// Spider
mysql_select_db ("DB NAME"$db);
$spider mysql_query("SELECT * FROM $table"); 

while(
$row mysql_fetch_array$spider )) {
 
 foreach( 
$row as $spurl){ 
  
$ch curl_init();
curl_setopt($chCURLOPT_URL,"$spurl");
curl_setopt($chCURLOPT_RETURNTRANSFER,1);
$result2=curl_exec ($ch);
curl_close ($ch);
   
if( 
$result2 ){
  
  
preg_match_all'/(http:\/\/w{3}\.[^.]+?\.[a-z]{3})/'$result2$output2PREG_SET_ORDER );
  foreach( 
$output2 as $item2 ){

// Write Data Into The Database
mysql_select_db ("DB NAME"$db);
$sqlquery2 "INSERT INTO $table
VALUES('
$id','$item2[1]')";
$results mysql_query($sqlquery2);
$see mysql_query("SELECT urls FROM $table"); 

while(
$row2 mysql_fetch_array$see )) {

/*
    echo ("<div style=\"text-align:left;\">");
    echo $row2["urls"];
    echo ("<br />");
echo ("</div>");
*/

}}}}}}}

?>


</div>
</body>
</html>



 Mobster



Edit: Code Updated

« Last Edit: May 13, 2007, 09:01:29 PM by Caligula » Logged
itchy
Rookie
**
Offline Offline

Posts: 12



View Profile
« Reply #1 on: May 13, 2007, 08:44:49 PM »

cool caligula i'll give this a whirl tomorrow as i'm in the spider writing testing phase of my current learning curve as well.
let you know how i get on.
Logged

No links in signatures please
Caligula
Rookie
**
Offline Offline

Posts: 39



View Profile
« Reply #2 on: May 13, 2007, 09:00:17 PM »

cool caligula i'll give this a whirl tomorrow as i'm in the spider writing testing phase of my current learning curve as well.
let you know how i get on.


Cool itchy...  one thing - the script above times out the browser... in fact I just got done taking care of that lil problem - special thanks to TD

*edit - fixed code timeout problem... updated above.


It will print a number 1...2...3...4..every 25 sec or whatever... apparently the browser needs to have data written to it to keep it from timing out... ( since I keep the echos commented out for faster spidering speed )

Let me know how your test run goes.....  Mobster



« Last Edit: May 13, 2007, 09:02:29 PM by Caligula » Logged
thedarkness
Lifer
*****
Offline Offline

Posts: 585



View Profile
« Reply #3 on: May 14, 2007, 03:47:51 AM »

Guys, if you're not doing any output and you don't take any input......

If you have shell access just run it from the command line with;
php index.php

it will hose your js but from memory that was just a timer? If so;
time php index.php

Oh, assumes you are on a linux type system.

Cheers,
td
Logged

"I want to be the guy my dog thinks I am."
 - Unknown
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #4 on: May 14, 2007, 04:44:38 AM »

Guys, if you're not doing any output and you don't take any input......


No input, no output?  That's my kinda script.   LOL


Cheesy
Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
Caligula
Rookie
**
Offline Offline

Posts: 39



View Profile
« Reply #5 on: May 14, 2007, 02:24:35 PM »

Heyyyyy there is output.... ROFLMAO it records all the URLs to a database... but for some reason the browser times out if it doesn't get anything... so this way it just prints a number every few seconds - so that the browser doesn't time out or slow the script down...(echoing the URLs hogs resources) I know this is shit to you pros... but to those of us just getting started with php...its a learning tool... That script is like my "notes from php class" ....

I don't run any of this stuff from a command line.....I believe what you are talking about I would need to have php installed on my computer.... I do not.

I have a super secret website which has all my little projects hidden on it.... ROFLMAO
Logged
thedarkness
Lifer
*****
Offline Offline

Posts: 585



View Profile
« Reply #6 on: May 14, 2007, 02:39:07 PM »



No input, no output?  That's my kinda script.   LOL



You jerkin' my chain Bomps?

What I meant was it doesn't take any input from the browser and it doesn't have to send any output to the browser so it's a prime candidate to run from the CLI.

Thanks for being a picky bastard though  ROFLMAO

Cheers,
td
Logged

"I want to be the guy my dog thinks I am."
 - Unknown
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!