The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 19, 2019, 05:30:38 AM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: IP Address Caching, IDing Spiders  (Read 3896 times)
cdc
Expert
****
Offline Offline

Posts: 105


View Profile
« on: April 23, 2007, 12:02:43 PM »

This thread is a split from Database Optimization

So it sounds like the consensus is that storing them as INTs may be slightly more efficient, but not enough to go and change your code around for. Am I right?

Perk, you mentioned that you're moving your IP list into cache (I'm guessing shared mem?) to do your lookups. Care to elaborate on how this is done?

In case you can't tell, I was trained with the "don't worry about efficiency, worry about code completeness and readability" but unfortunately that doesn't scale.  Roll Eyes
« Last Edit: April 23, 2007, 03:08:32 PM by perkiset » Logged

Will code for food.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #1 on: April 23, 2007, 12:22:27 PM »

Currently my entire spider DB is loaded into memcached as individual items. I have a boatload of stuff that <each> site needs to know when it starts up, so I query mcd for all of them at once and it's pretty fast.

The challenge I face with APC is that you really have two choices: either store and retrieve entire array (which will be a single entry in the cache), pay for that much of a memory movement and then hashing the array (the key is the ip address, the value is the name of the spider) - or perhaps putting unique entries for each item in the cache - which will be considerably more efficient from a lookup standpoint - all you're doing is the hashing and returning a few bytes for the spider name.

But I am also concerned with the size/count of items in the cache - so I've been playing with something like this:
spiderDB[octet1][octet2 & octet3 & octet4]
spiderDB[octet1][octet2][octet3 & octet4]

trying to glean the sweet spot between number of cache entres and bulk memory movement. Remember, that you cannot use something like $myArray = &acp_fetch('anarray') because the function will not return a reference - so you'll need to load the entire value into <your local heap> before you can use.

Taking string evaluations out of it, if I store the entire spiderbd as $spiderDB[$intValueOfIP] = 'Google' then I think the hashing might go way quicker. The APC docs claim no degredation in performance even into the 100s of thousands of entries, but I am inherently skeptical.

But I'm gonna give it a try because I like this integer idea...

/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
cdc
Expert
****
Offline Offline

Posts: 105


View Profile
« Reply #2 on: April 23, 2007, 02:51:45 PM »

We're getting a little off topic now, but I'm interested that you care about which spider is visiting your site. I don't even store the spider's name locally. I'm wondering (but not expecting you answer) why you care if a bot is from yahoo, google, or msn...
Logged

Will code for food.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #3 on: April 23, 2007, 03:13:58 PM »

I like knowing which spiders have visited which pages. I can see the frequency of visit, the frequency of return and the depth of crawl all the time. I keep the last 5 visits from every spider from every page in a rolling FIFO table.

This is not specific to spiders: I also use the GeoIP to send out content specific to various countries / regions as well - specifically pricing. So the inbound IP is pretty durn important to me.

But to return to your questions, there are times that I might want G to see something that Y does not and vice versa. Better throw that topic back @ the syndk8 Wink

/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!