The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 18, 2019, 10:31:50 AM

Login with username, password and session length


Pages: [1] 2
  Print  
Author Topic: Rank tracking over time - most effective DB structure?  (Read 7779 times)
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« on: April 08, 2008, 09:55:30 AM »

Hey all,

I've been puzzling over this for a while and havent come to any good conclusions yet - how would you go about structuring a DB to track SERPs over time?

Obviously its possible to simply record results against a date, but this strikes me as really inefficient, you could end up storing data that hasnt changed over and over. Equally however I'm not sure how i'd devise some kind of transactional logic that compares the existing record for a unique item (URL) and only updates if appropriate i.e. title tag could change, snippet could change not to mention ranking position.

I try to normalise my databases as much as possible but the more I think about this the more I can't quite figure out the correct tables and table types I'd need. I'm sure this is a common relational DB problem, even if its applied to things other than SERPs. Any tips?

DM
Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #1 on: April 08, 2008, 10:18:27 AM »

your only talking about URLs and Ranks, so the only normalizing you can do really is a table of URLs and URLids, and a rank table of URLids and Ranks.
the problem is that as the database gets bigger, the checks to see if the URL already exists will be kinda huge, assuming many keywords being tracked and many URLs. But maybe not, I am often surprised at how little the data sizes of things change once you seed it fully. my gut says yes though.

Another option is a table of
URLID keywordID date rank URL
keywords are in another table.
only add a new entry if the rank changes.
so if google.com is ranked #1 for "search engines", the next time you run the analysis and google is still #1, dont store it. that way you can do a grab of the newest rank 1,2,3,4,5,6 for a particular keyword without much trouble. however, this also would result in some heafty bloat.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #2 on: April 08, 2008, 11:09:51 AM »

so if google.com is ranked #1 for "search engines", the next time you run the analysis and google is still #1, dont store it. that way you can do a grab of the newest rank 1,2,3,4,5,6 for a particular keyword without much trouble. however, this also would result in some heafty bloat.

Definately, this is the issue I've been puzzling over - and its actually just got a little more complicated because I hadnt even got as far as considering multiple keywords yet! 

So far in my head I'd set the tables up like this; but I'm just not sure how to store the relationship with date correctly.

tbl_Urls
ID
URL
fk_Updates

tbl_Meta
ID
fk_Url
Title
Snippet
fk_Updates

tbl_Updates
ID
Update_DateTime

I'm thinking if its structured like this I can join tbl_Url to tbl_Updates to get a specific timeframe, then tag on the meta data hopefully with a condition that gets the maximum of tbl_Meta.fk_Updates while being less than or equal to the currently selected 'update' time.

Just read that back and I think i've even managed to confuse myself so not sure if anyone else has a hope of deciphering my ramblings  D'oh!.

DM
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #3 on: April 08, 2008, 12:31:57 PM »

I'd do something more akin to:

tbl_Keywords
===================
id int pk autoinc
phrase vc255 unique

tbl_URLs
===================
id int pk autoinc
url vc255 unique

tbl_Engines
===================
id int pk autoinc
name vc64 indexed

tbl_SERPs
===================
keyword_id int indexed
url_id int indexed
engine_id  indexed
(those three comprise my PK)
rank int
checkedon date

The serps table will be very small, comprising only 4 integers and a date field. You can search quickly by keyword, url, engine etc. Run a small cron job to clear out records in this table older than, say, 180 days and you'll never have troubles.

Don't underestimate the value of seeing something NOT change over and over and over... that is equally good information and the complexity you introduce by trying to short the db here will cost you later. HardDrive is cheap - your time is expensive.

Obviously, a base like this can be used to expand the DB in the direction it looks like you are going...
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #4 on: April 08, 2008, 12:47:56 PM »

the complexity you introduce by trying to short the db here will cost you later. HardDrive is cheap - your time is expensive.

^^ a great tip, I think I get caught up in the whole normalisation thing for no real reason.

Just to confirm, using an implentation like you described above Perk, will performance take a huge hit as the DB grows? I'm just thinking along the lines of 100 keywords, 100 results, 2 engines, once per week for 6 months = approx 100 x 100 x 2 x 24 = 480k rows - and it wouldnt be difficult to see this grow even further.

DM
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #5 on: April 08, 2008, 12:56:09 PM »

Not at all - that's a pretty tiny set, all in all. My current SERP tables are in the 8-10MM zone regularly, as are my surfer tracking and telemetry records. Correct indexing is ABSOLUTELY necessary, but the sizes you're talking about are really fine.

Bear in mind, that there's only one table growing really - just the SERP table which is tiny, light and fixed-length. It'll be way fast.

When you are coding it, I'd preload all the URLS, engines and phrases into arrays so that I can create update records without ever having to hit the DB for ID values.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #6 on: April 08, 2008, 03:18:44 PM »

480k pfft. no problem. though on a shared host you might get some complaints if your queries are not tuned.

perks schema looks solid, thought the varchar lengths might needs some dicking with, like the URL field for example.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #7 on: April 08, 2008, 03:41:34 PM »

...thought the varchar lengths might needs some dicking with, like the URL field for example.

isn't a URL locked into 255 chars? Thought it was...

anyhoo, if you choose larger, you'll need to go to a TEXT type field which changes your indexing options a bit
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #8 on: April 08, 2008, 05:13:23 PM »

nope a domain is max 100 plus extension, but a URL with params can be miles long.

varchar will still work if you are using 5.0.3 or higher. varchar(65535)
though that is also the row size limit, so i dont recommend that. however a few thousand can still be used if the rest of the row is just small datachunks like INT.
for URL lengths i stop at 1k since its just silly after that. (though 1k is silly as is...)
This allows you to still have the indexing advantage of varchar.


OH wait. you are thinking of the IE limit. ie 7 removed that i think. i am pretty sure it is now at least 1024.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #9 on: April 08, 2008, 05:16:27 PM »

Gotcha... thanks mang.

And about the varchar len as well... I'm gonna have to go look at that - that puts it into a text field size and I'm surprised that the index handle will manage that... thanks for the tip!
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #10 on: April 08, 2008, 05:29:28 PM »

actually dont quote me on the indexing. something in the back of my head is telling me that the index limit is 1024.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #11 on: July 14, 2008, 03:24:58 AM »

Sorry to drag this old thread up, but I've finally got around to implementing the system under discusion!

Quick question on the struction Perk outlined above, would it be more efficient to to have an additional table that was an intermediary between Keyword and Results, detailing the date of the scrape. i.e.

tbl_Keywords
id
keyword

tbl_ResultSets
id
date
fkKeyword

tbl_Result
id
Rank
fkResultSet

Just wondering if this would make it easier or more difficult to query.

Cheers,

DM
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #12 on: July 14, 2008, 09:21:45 AM »

Why an intermediate table? If the data is a 1->1 relationship, throw it in the same table. If it's 1->[n] then it must be another table... I guess I'm not sure what you're trying to do with that
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #13 on: July 14, 2008, 11:37:00 AM »

I'm thinking about the most effective way to get the data back out. So if I've multiple result sets relating to different dates for each keyword within the results table, i.e. top 100 results for a keyword week 1, top 100 results for same keyword week 2; I'm curious about the best way to pull the data back out.

If I had a middle table logging everytime a result set was grabbed wouldnt it be easier to query for the latest data for example?

Make a little more sense?

DM
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #14 on: July 14, 2008, 11:44:48 AM »

Well, functionally (from a speed perspective) it will make no difference. But from a clarity of thinking perspective, I think yes. With this sort of arrangement you have a list of collection dates that can then be used in a simple master-detail relationship with the data, which will make retrieval easier IMO.

(And if I read you right)
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Pages: [1] 2
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!