|
DangerMouse
|
 |
« on: April 08, 2008, 09:55:30 AM » |
|
Hey all,
I've been puzzling over this for a while and havent come to any good conclusions yet - how would you go about structuring a DB to track SERPs over time?
Obviously its possible to simply record results against a date, but this strikes me as really inefficient, you could end up storing data that hasnt changed over and over. Equally however I'm not sure how i'd devise some kind of transactional logic that compares the existing record for a unique item (URL) and only updates if appropriate i.e. title tag could change, snippet could change not to mention ranking position.
I try to normalise my databases as much as possible but the more I think about this the more I can't quite figure out the correct tables and table types I'd need. I'm sure this is a common relational DB problem, even if its applied to things other than SERPs. Any tips?
DM
|
|
|
|
|
Logged
|
|
|
|
|
nutballs
|
 |
« Reply #1 on: April 08, 2008, 10:18:27 AM » |
|
your only talking about URLs and Ranks, so the only normalizing you can do really is a table of URLs and URLids, and a rank table of URLids and Ranks. the problem is that as the database gets bigger, the checks to see if the URL already exists will be kinda huge, assuming many keywords being tracked and many URLs. But maybe not, I am often surprised at how little the data sizes of things change once you seed it fully. my gut says yes though.
Another option is a table of URLID keywordID date rank URL keywords are in another table. only add a new entry if the rank changes. so if google.com is ranked #1 for "search engines", the next time you run the analysis and google is still #1, dont store it. that way you can do a grab of the newest rank 1,2,3,4,5,6 for a particular keyword without much trouble. however, this also would result in some heafty bloat.
|
|
|
|
|
Logged
|
|
|
|
|
DangerMouse
|
 |
« Reply #2 on: April 08, 2008, 11:09:51 AM » |
|
so if google.com is ranked #1 for "search engines", the next time you run the analysis and google is still #1, dont store it. that way you can do a grab of the newest rank 1,2,3,4,5,6 for a particular keyword without much trouble. however, this also would result in some heafty bloat.
Definately, this is the issue I've been puzzling over - and its actually just got a little more complicated because I hadnt even got as far as considering multiple keywords yet!  So far in my head I'd set the tables up like this; but I'm just not sure how to store the relationship with date correctly. tbl_UrlsIDURL fk_Updates tbl_MetaIDfk_Url Title Snippet fk_Updates tbl_UpdatesIDUpdate_DateTime I'm thinking if its structured like this I can join tbl_Url to tbl_Updates to get a specific timeframe, then tag on the meta data hopefully with a condition that gets the maximum of tbl_Meta.fk_Updates while being less than or equal to the currently selected 'update' time. Just read that back and I think i've even managed to confuse myself so not sure if anyone else has a hope of deciphering my ramblings  . DM
|
|
|
|
|
Logged
|
|
|
|
perkiset
Olde World Hacker
Administrator
Lifer
   
Online
Posts: 5324
:sniffle: Humor was so much easier before.
|
 |
« Reply #3 on: April 08, 2008, 12:31:57 PM » |
|
I'd do something more akin to:
tbl_Keywords =================== id int pk autoinc phrase vc255 unique
tbl_URLs =================== id int pk autoinc url vc255 unique
tbl_Engines =================== id int pk autoinc name vc64 indexed
tbl_SERPs =================== keyword_id int indexed url_id int indexed engine_id indexed (those three comprise my PK) rank int checkedon date
The serps table will be very small, comprising only 4 integers and a date field. You can search quickly by keyword, url, engine etc. Run a small cron job to clear out records in this table older than, say, 180 days and you'll never have troubles.
Don't underestimate the value of seeing something NOT change over and over and over... that is equally good information and the complexity you introduce by trying to short the db here will cost you later. HardDrive is cheap - your time is expensive.
Obviously, a base like this can be used to expand the DB in the direction it looks like you are going...
|
|
|
|
|
Logged
|
If I can't be Mr. Root then I don't want to play.
|
|
|
|
DangerMouse
|
 |
« Reply #4 on: April 08, 2008, 12:47:56 PM » |
|
the complexity you introduce by trying to short the db here will cost you later. HardDrive is cheap - your time is expensive.
^^ a great tip, I think I get caught up in the whole normalisation thing for no real reason. Just to confirm, using an implentation like you described above Perk, will performance take a huge hit as the DB grows? I'm just thinking along the lines of 100 keywords, 100 results, 2 engines, once per week for 6 months = approx 100 x 100 x 2 x 24 = 480k rows - and it wouldnt be difficult to see this grow even further. DM
|
|
|
|
|
Logged
|
|
|
|
perkiset
Olde World Hacker
Administrator
Lifer
   
Online
Posts: 5324
:sniffle: Humor was so much easier before.
|
 |
« Reply #5 on: April 08, 2008, 12:56:09 PM » |
|
Not at all - that's a pretty tiny set, all in all. My current SERP tables are in the 8-10MM zone regularly, as are my surfer tracking and telemetry records. Correct indexing is ABSOLUTELY necessary, but the sizes you're talking about are really fine.
Bear in mind, that there's only one table growing really - just the SERP table which is tiny, light and fixed-length. It'll be way fast.
When you are coding it, I'd preload all the URLS, engines and phrases into arrays so that I can create update records without ever having to hit the DB for ID values.
|
|
|
|
|
Logged
|
If I can't be Mr. Root then I don't want to play.
|
|
|
|
nutballs
|
 |
« Reply #6 on: April 08, 2008, 03:18:44 PM » |
|
480k pfft. no problem. though on a shared host you might get some complaints if your queries are not tuned.
perks schema looks solid, thought the varchar lengths might needs some dicking with, like the URL field for example.
|
|
|
|
|
Logged
|
|
|
|
perkiset
Olde World Hacker
Administrator
Lifer
   
Online
Posts: 5324
:sniffle: Humor was so much easier before.
|
 |
« Reply #7 on: April 08, 2008, 03:41:34 PM » |
|
...thought the varchar lengths might needs some dicking with, like the URL field for example.
isn't a URL locked into 255 chars? Thought it was... anyhoo, if you choose larger, you'll need to go to a TEXT type field which changes your indexing options a bit
|
|
|
|
|
Logged
|
If I can't be Mr. Root then I don't want to play.
|
|
|
|
nutballs
|
 |
« Reply #8 on: April 08, 2008, 05:13:23 PM » |
|
nope a domain is max 100 plus extension, but a URL with params can be miles long.
varchar will still work if you are using 5.0.3 or higher. varchar(65535) though that is also the row size limit, so i dont recommend that. however a few thousand can still be used if the rest of the row is just small datachunks like INT. for URL lengths i stop at 1k since its just silly after that. (though 1k is silly as is...) This allows you to still have the indexing advantage of varchar.
OH wait. you are thinking of the IE limit. ie 7 removed that i think. i am pretty sure it is now at least 1024.
|
|
|
|
|
Logged
|
|
|
|
perkiset
Olde World Hacker
Administrator
Lifer
   
Online
Posts: 5324
:sniffle: Humor was so much easier before.
|
 |
« Reply #9 on: April 08, 2008, 05:16:27 PM » |
|
Gotcha... thanks mang.
And about the varchar len as well... I'm gonna have to go look at that - that puts it into a text field size and I'm surprised that the index handle will manage that... thanks for the tip!
|
|
|
|
|
Logged
|
If I can't be Mr. Root then I don't want to play.
|
|
|
|
nutballs
|
 |
« Reply #10 on: April 08, 2008, 05:29:28 PM » |
|
actually dont quote me on the indexing. something in the back of my head is telling me that the index limit is 1024.
|
|
|
|
|
Logged
|
|
|
|
|
DangerMouse
|
 |
« Reply #11 on: July 14, 2008, 03:24:58 AM » |
|
Sorry to drag this old thread up, but I've finally got around to implementing the system under discusion!
Quick question on the struction Perk outlined above, would it be more efficient to to have an additional table that was an intermediary between Keyword and Results, detailing the date of the scrape. i.e.
tbl_Keywords id keyword
tbl_ResultSets id date fkKeyword
tbl_Result id Rank fkResultSet
Just wondering if this would make it easier or more difficult to query.
Cheers,
DM
|
|
|
|
|
Logged
|
|
|
|
perkiset
Olde World Hacker
Administrator
Lifer
   
Online
Posts: 5324
:sniffle: Humor was so much easier before.
|
 |
« Reply #12 on: July 14, 2008, 09:21:45 AM » |
|
Why an intermediate table? If the data is a 1->1 relationship, throw it in the same table. If it's 1->[n] then it must be another table... I guess I'm not sure what you're trying to do with that
|
|
|
|
|
Logged
|
If I can't be Mr. Root then I don't want to play.
|
|
|
|
DangerMouse
|
 |
« Reply #13 on: July 14, 2008, 11:37:00 AM » |
|
I'm thinking about the most effective way to get the data back out. So if I've multiple result sets relating to different dates for each keyword within the results table, i.e. top 100 results for a keyword week 1, top 100 results for same keyword week 2; I'm curious about the best way to pull the data back out.
If I had a middle table logging everytime a result set was grabbed wouldnt it be easier to query for the latest data for example?
Make a little more sense?
DM
|
|
|
|
|
Logged
|
|
|
|
perkiset
Olde World Hacker
Administrator
Lifer
   
Online
Posts: 5324
:sniffle: Humor was so much easier before.
|
 |
« Reply #14 on: July 14, 2008, 11:44:48 AM » |
|
Well, functionally (from a speed perspective) it will make no difference. But from a clarity of thinking perspective, I think yes. With this sort of arrangement you have a list of collection dates that can then be used in a simple master-detail relationship with the data, which will make retrieval easier IMO.
(And if I read you right)
|
|
|
|
|
Logged
|
If I can't be Mr. Root then I don't want to play.
|
|
|
|