The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 21, 2019, 02:19:51 AM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: Compressing || minimizing data for db  (Read 2772 times)
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« on: October 01, 2009, 10:50:26 AM »

Let's say I have 1 million visitor per day. I collect IP, referer, timestamp and few other numbers like user id and so on. I just did a test and generated a month's worth of records. Database ended up to take 2.4 gigs. But now I want to save usage stats for every page and every visitor visits average 5 pages. So that's 5 million records per day and that would be about about 12 gigs per month. So as you can see, I'm only saving three longer data pieces: ip, referer and timestamp. What if I want to save a lot more data per row, how the hell I'm going to be able to cut down the size requirements? You can easily calculate that saving ten times more data per row will make the dataset way way bigger.

So any ideas how to compress or minimize data? I'm not doing any real-time processing with the data so little penalty from decompression isn't an issue. Is there any other solution than compressing individual strings? Maybe do daily sql dumps that are compressed and when needed decompressed and imported into database for processing? I'm not sure about this but last time I tried to import BIG dataset to MySQL it took looooonnnggg time, please do correct me if I'm wrong.
Logged

I met god and he had nothing to say to me.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #1 on: October 01, 2009, 02:19:43 PM »

First, distill your data. You do not need to keep a row for every visitor forever. Take it from someone who's tried... you don't. After (an amount of time) take the stats from the day and distill them into a single row that gives you overall traffic, average time per page, average total stay, hits, surfers, bots etc etc. Then you'll wind up having 365 rows per year per site Wink

It would be silly to track every single IP that comes in ... so make yourself a table of IPs you want to watch for. I DL the spiderSpy DB every day, convert it to mySQL and then see who's coming to dinner on the first call of every surfer. I keep bots, spiders, asshats etc in a table with a unique integer ID for each one - then when I see one, I simply create an xref record of date and id. Much lighter.

I track surfer actions, click by click and hang onto then for anywhere from 2 days to 2 months, depending on the application. A cron job nightly distills data based on the cutoff time for (that application), then I run a table optimize (to shrink it back down) and backup. Clean, simple.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #2 on: October 01, 2009, 03:38:37 PM »

agree with perk, but I also know that requirements from some PHB might prevent that.

SO...

Basically you are already barebones, except you could start getting away with some relationship benefits.

AgentStringTable
--AgentID
--AgentString

MainTable
--IP (store as int if you can its only 4bytes)
--fk_AgentID
--timestamp

any repetitive pieces of info, like agent string, should go into a relational table.

but... If you have unique data for EVERY user, then you REALLY need to decide how important it is.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #3 on: October 01, 2009, 11:42:06 PM »

Hmm, interesting thoughts. First I had a little resistance about deleting data but if I can process it every week for example, what the fuck I would do with the original data? However I think I still want to keep tightly compressed dumps because what if I figure out new stuff I would love to get from the old data. And like we all know, the more data, the more accurate the results.

By the way, if you want pretty fucking great file compressor, check my fellow country brother Sami's NanoZip. Now I don't know the guy, I just checked some recent compression benchmarks and NanoZip was leading. But like with all smaller programs you might want to keep that program backed up somewhere because if you lost the program and project is dead, you have a lot of files you can't decompress Smiley
Logged

I met god and he had nothing to say to me.
qweqwe
Rookie
**
Offline Offline

Posts: 14


View Profile
« Reply #4 on: January 12, 2010, 07:23:45 PM »

<spammer dick go boom>
« Last Edit: January 12, 2010, 08:33:28 PM by nutballs » Logged

No links in signatures please
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!