The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. October 16, 2019, 02:59:34 PM

Login with username, password and session length


Pages: [1] 2
  Print  
Author Topic: Cloaking thousands of ip!!  (Read 11422 times)
dbrown
Rookie
**
Offline Offline

Posts: 28


View Profile
« on: July 02, 2008, 09:20:27 PM »

Ok.. First off.. I know most of you guys from syndk8. I dont post there, I lurk. So tickle my balls and love it. Jackoff

Now that we got that out of the way.

I have the whole master/slave, auto site gen, ping, stuff, cloak network going via php. (and some shell scripts)
I was using iplist.com for my ip cloaking (yea I know). Then I found spiderSpy and was like yea baby, 20k ips to cloak. Then it hit me. How the fish should I go about cloaking this massive list, and how do I keep the load down on the servers.

Right now.. I am cloaking via php script that I wrote. It reads from the txt file that come from iplist. There is also a bunch of wiz bang logic there, so I would rater not move away from php unless you guys convince me otherwise.

What should I do?

flat files?
read from db?
apache tricks?
memcache?

Any ideas is much appreciated
« Last Edit: July 02, 2008, 09:29:01 PM by dbrown » Logged
Con
Rookie
**
Offline Offline

Posts: 20


Who's Next?


View Profile
« Reply #1 on: July 02, 2008, 09:22:54 PM »

I just run a check on a mysql db. Works fine.
But I'm not much of a coder and keep to the KISS rule whenever possible!  Grin

Good luck,
Con
Logged

I can't code. I'm just here to annoy all of you!
dbrown
Rookie
**
Offline Offline

Posts: 28


View Profile
« Reply #2 on: July 02, 2008, 09:26:02 PM »

Hmm.. I have tons of sites on my servers. If I introduce code that make them query a db with 20k rows im sure the load will skyrocket...

Logged
nop_90
Global Moderator
Lifer
*****
Offline Offline

Posts: 2203


View Profile
« Reply #3 on: July 02, 2008, 09:28:32 PM »

Yep mysql db is way to go with php

I was going to say load IP from and convert to an integer and put into a hash.
Then when visitor comes convert his IP to integer and lookup in hash.
That would be fastest method

But since PHP is not always loaded .......
Logged
dbrown
Rookie
**
Offline Offline

Posts: 28


View Profile
« Reply #4 on: July 02, 2008, 09:30:02 PM »

I edited my first post a bit to explain how I am already cloaking...
Logged
dbrown
Rookie
**
Offline Offline

Posts: 28


View Profile
« Reply #5 on: July 02, 2008, 09:31:31 PM »

and yes I seen perks spiderSpy script in the code repo... Is that my answer or can we go faster/more efficent?
Logged
dbrown
Rookie
**
Offline Offline

Posts: 28


View Profile
« Reply #6 on: July 02, 2008, 09:34:57 PM »

Also ... I am farming out installations of these sites. If I add a mysql step, my farm costs go up a bit. I also try to keep it simple.

My workers run a requirements check script, if all ok, ftp files, chmod, add info to the master list and then the master server takes care of everything from there on.
Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #7 on: July 03, 2008, 09:19:18 AM »

ah yes. perk and i were discussing this a few weeks ago, offline.

generally, a boatload of sites all distributed, phoning home to your DB, is not going to go well. I know from experience.

Assuming you have write capabilities in your directory structure, you have 2 options if you ignore the DB method.

First, flat file is the most obvious. But it also means that every single site has a file that is like 600k (or whatever the spider list is at now), and that is a big scan to run, every time you need to check an IP.

The other option is to use the file system to your advantage...
An IP is convenient because it gives you an interesting and obvious division of sets...
aaa.bbb.ccc.ddd
So... you can store the IPs as directories. no files.
so lets say you have these IPs
45.23.123.1
45.23.123.2
45.23.200.22
45.64.12.123
123.45.63.6
store them in the directory structure as
45
   23
      123
         1
         2
      200
         22
   64
      12
         123
123
   45
      63
         6


now to test an IP in PHP you would just do
if (is_dir("/45/23/200/2") {Thats a spider yo!}
or whatever the directory syntax would be, my kid was up all last night so I might be a bit retarded.
phone home for updates on whatever schedule you want. delete all the IP Dirs, and rebuild. or whatever method you think would be less wasteful. obviously this would work for any language as well.

the advantage is that it is completely portable, it uses the filesystem to check via a very efficient is_dir() which just checks to see if that path is even there (no walking or scaning that I know of).
the disadvantage is that it uses an assload of inodes (thats what their called right?), which some hosts will freak out on you when they see that potentially. but it is pretty efficient in that you reuse octets as needed.

Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #8 on: July 03, 2008, 10:50:34 AM »

DB I am currently moving all of my stuff to some rather sophisticated MySQL stuff - essentially a single stored function that looks up the surfer, tracks it (regardless of surfer, spider etc) all in a single shot so load is actually decreasing dramatically as I throw fewer requests at the server. But if you can't do such a thing, then may I suggest and Apache method which, to this day, is still one of the fastest methods I've ever created and will probably not show on your host as load against you at all, because the load shows as the user daemon or nobody or however the host has Apache set up. F'reals.

The only caveat is that I know RewriteMap can be used in an htaccess, but I have not done it personally (all my stuff is in Includes hanging off the httpd.conf).

The method is to convert the spiderSpy DB into an SMDB and then access it via mod_rewrite lookup table in the httpd.conf or .htaccess. First, I use a PERL script to convert the inbound spiderspy text into the database - here's the code: (Note - I don't even remember if I modify the text on the way in, so this code may not work as written for you). Apache caches the DB and this will really rock after it loads the file the first time.

Code:
#!/usr/bin/perl

use SDBM_File;
use Fcntl;

$botbasetxt_filename = '/www/resource/db/botbase.txt';
$botbasemap_filename = '/www/resource/db/bbase.map';
open (FILE, "$botbasetxt_filename");
        @list = (<FILE>);
close (FILE);

tie (%botbasemaps, SDBM_File, $botbasemap_filename, O_RDWR|O_CREAT, 0644);

foreach $line (@list) {
        ($ip, $se) = split (/\s/, $line);
        $botbasemaps{$ip} = $se;
}

untie %botbasemaps;
exit;

... before I make use of the DB file I convert it to botbase.* ...

... then I access the table and make the decision about what to do with the following chunk of httpd.conf:
Code:
        Options                 +FollowSymLinks
        RewriteMap              botbase dbm:/www/resource/db/botbase.map

        # if it's a spider, reroute the requext into the spidersite php system...
        # Note that regardless what input parameters come in, I rewrite them to
        # nothing but the searching engine. Spider URLS do not contain parameters.
        RewriteCond             ${botbase:%{REMOTE_ADDR}|,} >,
        RewriteRule             ^(.*)$          /php/spidersite/main.php$1?engine=${botbase:%{REMOTE_ADDR}}      [L]

the remains of my VirtualHost chunk rewrites the URL into my normal system (all my pages are dynamically generated from a single script/process) - so if it's a spider, it goes the route listed above, if not it falls into the normal route.

Hope that spins some gears,
/perk
« Last Edit: July 03, 2008, 10:54:19 AM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
dbrown
Rookie
**
Offline Offline

Posts: 28


View Profile
« Reply #9 on: July 03, 2008, 12:43:10 PM »

Wow.. Thanks for the great replies. The gears spin.

I was doing some more homework last night and I stumble on some of the PHP APC posts. Wow, I have a new toy. I never knew this functionality existed. I have eaccellerator install on my dedi's, but the user cache in APC tickle my balls..

The directory trick is genius. I like it. Other than APC obviously not being installed on all if any shared hosting accounts , was there any other reasons why your not taking that route?

FWIW..
Installed APC on my dev server. p4 2.8 w 4g of ram
Loaded all 20K+ ips into any array
Stored it in apc
Then fetched them all in right at 1 second. Damn
Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #10 on: July 03, 2008, 12:54:21 PM »

my methodology is very compact, no database, remote nodes that operate independantly and phone home when told to. I can run it on virtually any php5 host and it is very lightweight. The only requirement I have is php5 and XML. thats it. so for me the directory method rocks. currently my nodes dont use it though, because i am trying a different approach which doesnt care about IPs.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #11 on: July 03, 2008, 01:01:36 PM »

The directory trick is genius. I like it. Other than APC obviously not being installed on all if any shared hosting accounts , was there any other reasons why your not taking that route?
This method NBs and I worked through for quite a while as a workaround method for really,really bad hosts that don't have many of the contemporary tools. Why someone would ever pick a host like that is beyond me  Roll Eyes  Mobster But I digress. This method is quick enough, but today will take up close to 30K inodes on a box... so the problem here is if a host won't let you go that route. An alternate we came up with was to do the first 3 octets as directories and then the last octet as a file that you would load, or better, php include (this is the fastest) so you keep the total number of inodes under what your host will allow - this is particularly important if you have a boatload of pages on a spam site, for example.

The fastest way to do this is to convert the node address (the last octet) and spider name into an array, then store it as a serialized array in a file. Then the "include" php looks something like this: (image you name the file "spiders.txt")

<?php
$spiderArr = unserialize('{... the serialized array ...}');
?>

...that's it. Then when your master file simply needs to do something like this:
$spiderArr = array();
preg_match('/^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/', $SERVER['REMOTE_ADDR'], $parts);
$dir = str_replace('.', '/', $parts[0]);
include "$dir/spiders.txt";

At this point, $spiderArr either contains nothing (what you originally set it to) or it has been overwritten by the included file. Since it's "include" not "require" PHP will not stop (it will only complain if you have your error-reporting set too low) - and then you can ask the array to see if the current REMOTE_ADDR is a spider or not.

If you have the RAM, you can do this via APC, although there are many methods and arguments about hashing speed, too many elements in the array, running out of cache RAM and having to shift to disc all the time... it's a pretty big topic. I toyed with it for a while but found that I didn't like any of my solutions, so I passed on APCing the spiderSpy DB. Again, I am now doing almost all my spider IDing, tracking and cloak setup in a single stored procedure, so I'm quite a ways away from this ATM.

Installed APC on my dev server. p4 2.8 w 4g of ram
Loaded all 20K+ ips into any array
Stored it in apc
Then fetched them all in right at 1 second. Damn
The way APC works is that it keeps a copy of <whatever you store> in RAM - this sounds great, but consider this: if you load a 29K spider IP table into an array, then use that array for every page call, you're making another copy of that 29K array for every instance and every paged called, every time - this is ENORMOUSLY ram and processor intensive - it is a bad way to go. Just sayin.

/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #12 on: July 03, 2008, 01:02:42 PM »

my methodology is very compact, no database, remote nodes that operate independantly and phone home when told to. I can run it on virtually any php5 host and it is very lightweight. The only requirement I have is php5 and XML. thats it. so for me the directory method rocks. currently my nodes dont use it though, because i am trying a different approach which doesnt care about IPs.
Sounds like you've completely buttoned up what we were discussing my friend... well done Wink
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #13 on: July 03, 2008, 01:21:07 PM »

if i get a chance, i will polish it and post it. not sure when that will happen mind you, but i'll try to get to it.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
nop_90
Global Moderator
Lifer
*****
Offline Offline

Posts: 2203


View Profile
« Reply #14 on: July 03, 2008, 03:11:05 PM »

Just use the berkley DB.
It will be much faster then any of the alternatives mentioned.
And it will be simplest and most portable.

If u are very concerned about speed, there are faster alternatives to the berkley db.
But you will have to make ur own C binding.
Logged
Pages: [1] 2
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!