
![]() |
DangerMouse
Hey all,
I've been puzzling over this for a while and havent come to any good conclusions yet - how would you go about structuring a DB to track SERPs over time? Obviously its possible to simply record results against a date, but this strikes me as really inefficient, you could end up storing data that hasnt changed over and over. Equally however I'm not sure how i'd devise some kind of transactional logic that compares the existing record for a unique item (URL) and only updates if appropriate i.e. title tag could change, snippet could change not to mention ranking position. I try to normalise my databases as much as possible but the more I think about this the more I can't quite figure out the correct tables and table types I'd need. I'm sure this is a common relational DB problem, even if its applied to things other than SERPs. Any tips? DM nutballs
your only talking about URLs and Ranks, so the only normalizing you can do really is a table of URLs and URLids, and a rank table of URLids and Ranks.
the problem is that as the database gets bigger, the checks to see if the URL already exists will be kinda huge, assuming many keywords being tracked and many URLs. But maybe not, I am often surprised at how little the data sizes of things change once you seed it fully. my gut says yes though. Another option is a table of URLID keywordID date rank URL keywords are in another table. only add a new entry if the rank changes. so if google.com is ranked #1 for "search engines", the next time you run the analysis and google is still #1, dont store it. that way you can do a grab of the newest rank 1,2,3,4,5,6 for a particular keyword without much trouble. however, this also would result in some heafty bloat. DangerMouse
quote author=nutballs link=topic=879.msg6211#msg6211 date=1207675107 so if google.com is ranked #1 for "search engines", the next time you run the analysis and google is still #1, dont store it. that way you can do a grab of the newest rank 1,2,3,4,5,6 for a particular keyword without much trouble. however, this also would result in some heafty bloat. Definately, this is the issue I've been puzzling over - and its actually just got a little more complicated because I hadnt even got as far as considering multiple keywords yet! ![]() So far in my head I'd set the tables up like this; but I'm just not sure how to store the relationship with date correctly. <>tbl_Urls> <>ID> URL fk_Updates <>tbl_Meta> <>ID> fk_Url Title Snippet fk_Updates <>tbl_Updates> <>ID> Update_DateTime I'm thinking if its structured like this I can join tbl_Url to tbl_Updates to get a specific timeframe, then tag on the meta data hopefully with a condition that gets the maximum of tbl_Meta.fk_Updates while being less than or equal to the currently selected 'update' time. Just read that back and I think i've even managed to confuse myself so not sure if anyone else has a hope of deciphering my ramblings ![]() DM perkiset
I'd do something more akin to:
tbl_Keywords =================== id int pk autoinc phrase vc255 unique tbl_URLs =================== id int pk autoinc url vc255 unique tbl_Engines =================== id int pk autoinc name vc64 indexed tbl_SERPs =================== keyword_id int indexed url_id int indexed engine_id indexed (those three comprise my PK) rank int checkedon date The serps table will be very small, comprising only 4 integers and a date field. You can search quickly by keyword, url, engine etc. Run a small cron job to clear out records in this table older than, say, 180 days and you'll never have troubles. Don't underestimate the value of seeing something NOT change over and over and over... that is equally good information and the complexity you introduce by trying to short the db here will cost you later. HardDrive is cheap - your time is expensive. Obviously, a base like this can be used to expand the DB in the direction it looks like you are going... DangerMouse
quote author=perkiset link=topic=879.msg6216#msg6216 date=1207683117 the complexity you introduce by trying to short the db here will cost you later. HardDrive is cheap - your time is expensive. ^^ a great tip, I think I get caught up in the whole normalisation thing for no real reason. Just to confirm, using an implentation like you described above Perk, will performance take a huge hit as the DB grows? I'm just thinking along the lines of 100 keywords, 100 results, 2 engines, once per week for 6 months = approx 100 x 100 x 2 x 24 = 480k rows - and it wouldnt be difficult to see this grow even further. DM perkiset
Not at all - that's a pretty tiny set, all in all. My current SERP tables are in the 8-10MM zone regularly, as are my surfer tracking and telemetry records. Correct indexing is ABSOLUTELY necessary, but the sizes you're talking about are really fine.
Bear in mind, that there's only one table growing really - just the SERP table which is tiny, light and fixed-length. It'll be way fast. When you are coding it, I'd preload all the URLS, engines and phrases into arrays so that I can create update records without ever having to hit the DB for ID values. nutballs
480k pfft. no problem. though on a shared host you might get some complaints if your queries are not tuned.
perks schema looks solid, thought the varchar lengths might needs some dicking with, like the URL field for example. perkiset
quote author=nutballs link=topic=879.msg6221#msg6221 date=1207693124 ...thought the varchar lengths might needs some dicking with, like the URL field for example. isn't a URL locked into 255 chars? Thought it was... anyhoo, if you choose larger, you'll need to go to a TEXT type field which changes your indexing options a bit nutballs
nope a domain is max 100 plus extension, but a URL with params can be miles long.
varchar will still work if you are using 5.0.3 or higher. varchar(65535) though that is also the row size limit, so i dont recommend that. however a few thousand can still be used if the rest of the row is just small datachunks like INT. for URL lengths i stop at 1k since its just silly after that. (though 1k is silly as is...) This allows you to still have the indexing advantage of varchar. OH wait. you are thinking of the IE limit. ie 7 removed that i think. i am pretty sure it is now at least 1024. perkiset
Gotcha... thanks mang.
And about the varchar len as well... I'm gonna have to go look at that - that puts it into a text field size and I'm surprised that the index handle will manage that... thanks for the tip! nutballs
actually dont quote me on the indexing. something in the back of my head is telling me that the index limit is 1024.
|

Thread Categories

![]() |
![]() |
Best of The Cache Home |
![]() |
![]() |
Search The Cache |
- Ajax
- Apache & mod_rewrite
- BlackHat SEO & Web Stuff
- C/++/#, Pascal etc.
- Database Stuff
- General & Non-Technical Discussion
- General programming, learning to code
- Javascript Discussions & Code
- Linux Related
- Mac, iPhone & OS-X Stuff
- Miscellaneous
- MS Windows Related
- PERL & Python Related
- PHP: Questions & Discussion
- PHP: Techniques, Classes & Examples
- Regular Expressions
- Uncategorized Threads