cdc

This thread is a split from Database Optimization

So it sounds like the consensus is that storing them as INTs may be slightly more efficient, but not enough to go and change your code around for. Am I right?

Perk, you mentioned that you're moving your IP list into cache (I'm guessing shared mem?) to do your lookups. Care to elaborate on how this is done?

In case you can't tell, I was trained with the "don't worry about efficiency, worry about code completeness and readability" but unfortunately that doesn't scale.  :Applause

perkiset

Currently my entire spider DB is loaded into memcached as individual items. I have a boatload of stuff that <each> site needs to know when it starts up, so I query mcd for all of them at once and it's pretty fast.

The challenge I face with APC is that you really have two choices: either store and retrieve entire array (which will be a single entry in the cache), pay for that much of a memory movement and then hashing the array (the key is the ip address, the value is the name of the spider) - or perhaps putting unique entries for each item in the cache - which will be considerably more efficient from a lookup standpoint - all you're doing is the hashing and returning a few bytes for the spider name.

But I am also concerned with the size/count of items in the cache - so I've been playing with something like this:
spiderDB[octet1][octet2 & octet3 & octet4]
spiderDB[octet1][octet2][octet3 & octet4]

trying to glean the sweet spot between number of cache entres and bulk memory movement. Remember, that you cannot use something like $myArray = &acp_fetch('anarray') because the function will not return a reference - so you'll need to load the entire value into <your local heap> before you can use.

Taking string evaluations out of it, if I store the entire spiderbd as $spiderDB[$intValueOfIP] = 'Google' then I think the hashing might go way quicker. The APC docs claim no degredation in performance even into the 100s of thousands of entries, but I am inherently skeptical.

But I'm gonna give it a try because I like this integer idea...

/p

cdc

We're getting a little off topic now, but I'm interested that you care about which spider is visiting your site. I don't even store the spider's name locally. I'm wondering (but not expecting you answer) why you care if a bot is from yahoo, google, or msn...

perkiset

I like knowing which spiders have visited which pages. I can see the frequency of visit, the frequency of return and the depth of crawl all the time. I keep the last 5 visits from every spider from every page in a rolling FIFO table.

This is not specific to spiders: I also use the GeoIP to send out content specific to various countries / regions as well - specifically pricing. So the inbound IP is pretty durn important to me.

But to return to your questions, there are times that I might want G to see something that Y does not and vice versa. Better throw that topic back @ the syndk8 Applause

/p


Perkiset's Place Home   Politics @ Perkiset's