The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 21, 2019, 06:30:38 AM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: detect and ban scrapers  (Read 2063 times)
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« on: November 25, 2008, 08:58:06 PM »

Because Im a bastard, scrapers don't get to my goodies anyway, but they are pooching my logs with 100's of entries that are skewing my data.
SO....

this has to be file based, so no DB.

my idea is to store a running list of the past 100 unidentified(not a spider for example) ips to hit a site.
then when an ip hits, i check to see if it occurs more than X times in the tracking file, if so, ban, otherwise, let it through.

i don't really care about times I think, because frankly there will never be a legitimate reason a surfer would hit the same site more than 1 or 2 times, EVER. I think this can also help me to identify new BOTs though it might be too late for that site, others will benefit.

thoughts?
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
ekibastos
Rookie
**
Offline Offline

Posts: 38


View Profile
« Reply #1 on: November 26, 2008, 12:20:05 AM »

i don't really care about times I think, because frankly there will never be a legitimate reason a surfer would hit the same site more than 1 or 2 times, EVER. ...
 
thoughts?

What about 20 people in an office convention all going to the site from the same wifi connection at the hotel..... 20 different people would really want to see the same site and have the same ip if they were in a meeting it might be all within 10 minutes or an hour or a day ... or if someone reboots a few times and firefox restores all their tabs but it takes like 3 reboots to get their computer updated and straight they could be stuck in a reboot loop with multiple home page tabs or something...

I guess you could do some sort of java or flash thing client side to execute on a real browser that would if executed send back a random number based on a recent actual mouse movement event or whatever that would link an unique id to a certain log event that would be depooched perhaps and prove a simple html based spider wasnt running?

Just a thought from the midnite brain..

I understand that you can emulate storing and the pretended retrieval of cookie data but do they have java executing spiders? I don't know much about it at all really but this post has me curious now....

popcorn.gif


Logged

<quote nutballs>
Apple is that hot chick that gives you lots of sex but is a total bitch with a horrible abusive temper.
</quote>
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #2 on: November 26, 2008, 06:40:39 AM »

Check the UAs. I know it's not a final solution but I was getting scraped hard on one forum and I found that the scraper was always using some weirdass useragent, something to do with Epson and Bluetooth. Anyway I banned that UA and the problem went away. Obviously that's easy to circumvent by using a standard UA, but it worked in this case.
Logged

hai
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #3 on: November 26, 2008, 07:41:41 AM »

EK: on a 'normal' site I would never do this, however, the likelyhood of 2 office people hitting my sites within a reasonable period, and actually be looking to buy something is SO LOW it's silly. AOL however is a different beast... So V's addition makes sense.
I guess I clarify also that a user NEVER sees my sight, so i cant do any client side validation, which I already have a smoking little trick for posted elsewhere on here with something about BEACON in the title.

V: Adding in the UAs then probably will handle most of the AOL type situations. So i instead ban by IP/UA.

cool thanks for the gear turners guys.
« Last Edit: November 26, 2008, 07:44:55 AM by nutballs » Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!