DangerMouse

Hey all,

I've been puzzling over this for a while and havent come to any good conclusions yet - how would you go about structuring a DB to track SERPs over time?

Obviously its possible to simply record results against a date, but this strikes me as really inefficient, you could end up storing data that hasnt changed over and over. Equally however I'm not sure how i'd devise some kind of transactional logic that compares the existing record for a unique item (URL) and only updates if appropriate i.e. title tag could change, snippet could change not to mention ranking position.

I try to normalise my databases as much as possible but the more I think about this the more I can't quite figure out the correct tables and table types I'd need. I'm sure this is a common relational DB problem, even if its applied to things other than SERPs. Any tips?

DM

nutballs

your only talking about URLs and Ranks, so the only normalizing you can do really is a table of URLs and URLids, and a rank table of URLids and Ranks.
the problem is that as the database gets bigger, the checks to see if the URL already exists will be kinda huge, assuming many keywords being tracked and many URLs. But maybe not, I am often surprised at how little the data sizes of things change once you seed it fully. my gut says yes though.

Another option is a table of
URLID keywordID date rank URL
keywords are in another table.
only add a new entry if the rank changes.
so if google.com is ranked #1 for "search engines", the next time you run the analysis and google is still #1, dont store it. that way you can do a grab of the newest rank 1,2,3,4,5,6 for a particular keyword without much trouble. however, this also would result in some heafty bloat.

DangerMouse

quote author=nutballs link=topic=879.msg6211#msg6211 date=1207675107

so if google.com is ranked #1 for "search engines", the next time you run the analysis and google is still #1, dont store it. that way you can do a grab of the newest rank 1,2,3,4,5,6 for a particular keyword without much trouble. however, this also would result in some heafty bloat.


Definately, this is the issue I've been puzzling over - and its actually just got a little more complicated because I hadnt even got as far as considering multiple keywords yet!  Applause

So far in my head I'd set the tables up like this; but I'm just not sure how to store the relationship with date correctly.

<>tbl_Urls
<>ID
URL
fk_Updates

<>tbl_Meta
<>ID
fk_Url
Title
Snippet
fk_Updates

<>tbl_Updates
<>ID
Update_DateTime

I'm thinking if its structured like this I can join tbl_Url to tbl_Updates to get a specific timeframe, then tag on the meta data hopefully with a condition that gets the maximum of tbl_Meta.fk_Updates while being less than or equal to the currently selected 'update' time.

Just read that back and I think i've even managed to confuse myself so not sure if anyone else has a hope of deciphering my ramblings  Applause.

DM

perkiset

I'd do something more akin to:

tbl_Keywords
===================
id int pk autoinc
phrase vc255 unique

tbl_URLs
===================
id int pk autoinc
url vc255 unique

tbl_Engines
===================
id int pk autoinc
name vc64 indexed

tbl_SERPs
===================
keyword_id int indexed
url_id int indexed
engine_id  indexed
(those three comprise my PK)
rank int
checkedon date

The serps table will be very small, comprising only 4 integers and a date field. You can search quickly by keyword, url, engine etc. Run a small cron job to clear out records in this table older than, say, 180 days and you'll never have troubles.

Don't underestimate the value of seeing something NOT change over and over and over... that is equally good information and the complexity you introduce by trying to short the db here will cost you later. HardDrive is cheap - your time is expensive.

Obviously, a base like this can be used to expand the DB in the direction it looks like you are going...

DangerMouse

quote author=perkiset link=topic=879.msg6216#msg6216 date=1207683117

the complexity you introduce by trying to short the db here will cost you later. HardDrive is cheap - your time is expensive.


^^ a great tip, I think I get caught up in the whole normalisation thing for no real reason.

Just to confirm, using an implentation like you described above Perk, will performance take a huge hit as the DB grows? I'm just thinking along the lines of 100 keywords, 100 results, 2 engines, once per week for 6 months = approx 100 x 100 x 2 x 24 = 480k rows - and it wouldnt be difficult to see this grow even further.

DM

perkiset

Not at all - that's a pretty tiny set, all in all. My current SERP tables are in the 8-10MM zone regularly, as are my surfer tracking and telemetry records. Correct indexing is ABSOLUTELY necessary, but the sizes you're talking about are really fine.

Bear in mind, that there's only one table growing really - just the SERP table which is tiny, light and fixed-length. It'll be way fast.

When you are coding it, I'd preload all the URLS, engines and phrases into arrays so that I can create update records without ever having to hit the DB for ID values.

nutballs

480k pfft. no problem. though on a shared host you might get some complaints if your queries are not tuned.

perks schema looks solid, thought the varchar lengths might needs some dicking with, like the URL field for example.

perkiset

quote author=nutballs link=topic=879.msg6221#msg6221 date=1207693124

...thought the varchar lengths might needs some dicking with, like the URL field for example.


isn't a URL locked into 255 chars? Thought it was...

anyhoo, if you choose larger, you'll need to go to a TEXT type field which changes your indexing options a bit

nutballs

nope a domain is max 100 plus extension, but a URL with params can be miles long.

varchar will still work if you are using 5.0.3 or higher. varchar(65535)
though that is also the row size limit, so i dont recommend that. however a few thousand can still be used if the rest of the row is just small datachunks like INT.
for URL lengths i stop at 1k since its just silly after that. (though 1k is silly as is...)
This allows you to still have the indexing advantage of varchar.


OH wait. you are thinking of the IE limit. ie 7 removed that i think. i am pretty sure it is now at least 1024.

perkiset

Gotcha... thanks mang.

And about the varchar len as well... I'm gonna have to go look at that - that puts it into a text field size and I'm surprised that the index handle will manage that... thanks for the tip!

nutballs

actually dont quote me on the indexing. something in the back of my head is telling me that the index limit is 1024.


Perkiset's Place Home   Politics @ Perkiset's