The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 21, 2019, 01:52:10 AM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: Scraping G serps - how to avoid IP-ban?  (Read 8825 times)
netmktg
Rookie
**
Offline Offline

Posts: 37



View Profile
« on: November 29, 2008, 09:28:08 AM »

I was just looking at the Blurb of serpscraper.com and was intrigued by the mention of a BH secret that "cannot be entirely reviled" which works around the Google IP-ban. I'm sure software by Earl cannot be "Reviled" but I was hoping this dark secret can be "Revealed" by the gurus@Perkiset
Logged
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #1 on: November 29, 2008, 11:55:57 AM »

There are only a few factors.

For one, you need a large word list. For another, you need a number of common useragents. Your queries should be as unique as possible to avoid banning.

Here's a piece of code to get your gears turning:

Code:
<?php
class scrapeGoogle{
private $arrResults = array();
private $arrWordlist = array();
private $googleURL 'http://www.google.com/search?hl=en&q=[QUERY]&start=[START]&num=100&btnG=Google+Search';
public function __construct(){
$this->arrWordlist file('data/wordlist.txt');
foreach($this->arrWordlist as $key => $word){
$this->arrWordlist[$key] = trim($word);
}
}
public function scrape($term$page 1){
$keyword $this->arrWordlist[array_rand($this->arrWordlist)]
." "
.$this->arrWordlist[array_rand($this->arrWordlist)]
." "
.$this->arrWordlist[array_rand($this->arrWordlist)]
." "
.$this->arrWordlist[array_rand($this->arrWordlist)]
." "
.$this->arrWordlist[array_rand($this->arrWordlist)];
$query urlencode($term.' '.$keyword);
$scrapeURL str_replace('[QUERY]'$query$this->googleURL);
$scrapeURL str_replace('[START]',(($page 1) * 100) ,$scrapeURL);
$ch curl_init();
curl_setopt($chCURLOPT_FOLLOWLOCATION1);
curl_setopt($chCURLOPT_RETURNTRANSFER1);
curl_setopt($chCURLOPT_USERAGENT'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1)Gecko/20061010 Firefox/2.0');
curl_setopt($chCURLOPT_URL$scrapeURL);
$response curl_exec($ch);
while(strstr($response'403 Forbidden')){
sleep(rand(1,60));
$response curl_exec($ch);
}
curl_close($ch);
$cleanHTML tidy_parse_string($response);
tidy_clean_repair($cleanHTML);
$DOMDoc = new DOMDocument();
@$DOMDoc->loadHTML($cleanHTML);
@$nodelist $DOMDoc->getElementsByTagName('a');
foreach($nodelist as $node){
if($node->getAttribute('class') == 'l'){
if(!array_keys($this->arrResults$node->getAttribute('href'))){
echo $this->arrResults[] = $node->getAttribute('href');
echo "\n";
flush();
ob_flush();
}
}
}
}
}
?>


It requires PHP5 (PHP DOM) and the Tidy extensions to work properly.
Logged

hai
dink
Expert
****
Offline Offline

Posts: 349


View Profile
« Reply #2 on: November 30, 2008, 10:53:44 AM »

Hey netm . . .  you could also try your scraper on AOL search.  Or, code the scraper to pull requests for different G datacenters on each run.
Logged

[quote Nutballs]
the universe has a giant fist, and its got enough whoop ass for everyone.
[/quote]
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #3 on: November 30, 2008, 12:19:04 PM »

I guess it depends on the purpose of the scrape - rank checking versus target acquisition.

I've heard that sending multiple requests in the same packet a la the HTTP 1.1 specification works - pipelining. I've not really tried it enough to confirm though.

DM
Logged
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #4 on: November 30, 2008, 11:59:18 PM »

I guess it depends on the purpose of the scrape - rank checking versus target acquisition.

I've heard that sending multiple requests in the same packet a la the HTTP 1.1 specification works - pipelining. I've not really tried it enough to confirm though.

DM

I spent a lot of time researching this.

Multiple requests with one tcp connection; the Keep-Alive directive of Apache.

Anyways, it does work, in fact, browsers already use the feature. 

You can see the:

Connection: Keep-Alive
Keep-Alive: 300

with liveheaders.  The 300 is seconds, not requests heh.

And Apache has this feature enabled by default.

Cheesy

But there are several problems:

1. Google does not use the out of the box Apache, LOL.  Their version is heavily modified and renamed: GWS.

2. Apache's default setting is a maxiimum of five requests per tcp connect which sorta defeats
the whole purpose, but what do I know.

3.  Any savvy admin can lower the max requests or simply disable the Keep-Alive altogether.



thank you and come again,
Bomps

« Last Edit: December 01, 2008, 12:01:18 AM by Bompa » Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #5 on: December 07, 2008, 09:38:59 AM »

i am so glad I stocked up on API keys a long time ago. LOL

Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
mampy
Journeyman
***
Offline Offline

Posts: 68


View Profile
« Reply #6 on: December 07, 2008, 12:48:02 PM »

LOL....

Does anyone have any knowledge on the availability of the MAC address with each packet received from apache
and could this be a method of banning?
Logged
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #7 on: December 07, 2008, 07:03:16 PM »

Apache can't specify packets on that low a level. The first few bytes of an IP header is what you're looking for, the OS controls everything that low. Apache doesn't even handle the TCP handshake, just the HTTP serving.

I think, anyway.
Logged

hai
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #8 on: December 08, 2008, 07:46:16 AM »

umm... dunno that VS... Apache is responsible to open and handle the socket, only the the event is vectored to it by the OS... I think  Nerd
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #9 on: December 08, 2008, 07:51:05 AM »

Yes I think you're correct perks. I was thinking of it too much from a client perspective.
Logged

hai
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!