 |
cdc
Anybody know any good (or great) tutor ials on database optimization? I remember the basics from my college classes, but I'd really like to take it a step further. I'm looking for answers to my questions like: 1. When should I create a second index on my tables aside from the auto incrementing id field? 2. What should I use to store IP addresses that are looked up often, never changed, and only added to infrequently? I'm guessing a VARCHAR isn't best... etc.
perkiset
Unfortunately dbDog is on vacation in Hawaii or he'd be all over this. From what I remember about academic "optimization studies" they were all crap. Had no solid connection to the real world. Simple answers to your first questions: quote author=cdc link=topic=65.msg230#msg230 date=1177187556 1. When should I create a second index on my tables aside from the auto incrementing id field?
When you have more than 1000 records in the table and intend to do lookups in a particular way. You should never have indicies for things you won't be looking up frequently. Additionally, indicies may not be behaving the way you'd expect or even used. The Cardinality of the index is really important. In MySQL for example, if there is not a cardinality of at least 50% ie., no more than 1/2 the records are duplication, then the index will not even be used in a search. Consider: if you had a field called "fulfilled" and it is either on or off (you're selling something and you want to quickly get a list of all fulfilled or not) with 10000 records. The index will not com into play because there is not enough difference in the records to do so. If you simply make the index fulfilled + id (the unique number) it will be quick as you'd expect. When you intend to do a bunch of queries on a table, it helps to have MySQL explain the SQL it will be using to execute the query. This will help in your optimization. quote author=cdc link=topic=65.msg230#msg230 date=1177187556 2. What should I use to store IP addresses that are looked up often, never changed, and only added to infrequently? I'm guessing a VARCHAR isn't best...
It would seem - and years ago that was true. I remember dbDog drilling me 10 years ago if I chose a varchar over a char in Oracle because of speed differences... but in MySQL today (standard tables, not innodb and such) I think anything over 12 chars IS a varchar automatically. I store my IPs in a 16 char varchar and it works great. /p
cdc
Ok, thanks perk. quote When you intend to do a bunch of queries on a table, it helps to have MySQL explain the SQL it will be using to execute the query. How do you get MySQL to tell you the SQL it will be using? quote I store my IPs in a 16 char varchar and it works great. Well, I'm happy to hear that because I'm doing the same thing. My server has been running at full speed recently and I thought my recent introduction of cloaking might be the reason, but I'm doing much more than that. It's good to know that I can cross that off the list of things to worry about optimizing.
perkiset
quote author=cdc link=topic=65.msg239#msg239 date=1177197297 How do you get MySQL to tell you the SQL it will be using?
EXPLAIN select * from mytable where this=that order by another The resulting table will contain the hierarichy of how it will attack the puzzle of your query. Put some tables together, put some queries up against them and post what you get - we can help you understand what it's saying to you. /p
thedarkness
Yeah, post what you get, EXPLAIN can be a bit daunting when you first attack it so it'll prolly cut the learn ing
curve for a few souls. Cheers, td
nutballs
the best way to store IPs is to convert them to an integer. Actually BigInt in MSsql and i am pretty sure in MySQL as well. I posted an ASP version of an IP2num converter here: http://www.perkiset.org/forum/ asp _net /classic asp _ip_to_num_and_num_to_ip_for_database_storage-t72.0.html
perkiset
I like that and had not ever thought about it that way. When you need to list out all IPs that are in the DB, doesn't that add a huge amount of overhead to the process though? When you assert it is the best way, may I ask why? I've stored them as varchars, 4 sets of 3 chars, but never a big int and I am struggling to see why that is superior?
nutballs
Yea listing out the IPs ads overhead, obviously because it has to translate from Int to IP. however, usually you would do something like that somewhere other that the SQL server. forexample, have the webserver do it, or the client side app. the reason its better is that INT lookups are always faster. (unless someone changed the rules and i wasnt paying attention). they also use less memory (which also makes them faster as well). Thats the reason why you try to always have a numeric ID column to join tables on. have you ever tried joining on text? I'm sure you dont, unless you HAVE to, like joining sales to a product title. and you wait...
perkiset
Gotcha that makes sense. And yup, of course I join with the cleverly named "id" field which is the PKey on all tables. But an indexed lookup should encur the same overhead, unless I am missing something as well... no? Also, a BigInt in MySQL is 20 bytes and you can store an entire address in 16 - I did not look to see if your routine scrunched the number down below the stock 11 byte MySQL integer. But if you're now talking hash tables or cached lookups in RAM, I totally agree.
nutballs
your right bigint is bigger duh. i was flipflopping between int and big. but when you add in the storage of the index table, the 2 lookups to find 1 value, i still think the INT method works faster. its moot though, unless your talking millions of lookups per minute probably. I just have always done it that way because at the time i started doing it, every cycle counted on the SQL box, even at 10 hits a minute probably.
perkiset
Where this routine intrigues me the most is in my cached spider lookups. I move addresses into the cache to watch for spiders, denieds and such and I like this a lot. Since I'd pay for the conversion of all records only 4 times a day (when I re-up my spider listings) and only on a single address when looking at < the current surfer > but the "walk the hash routine" is looking at ints rather than strings I think I'd get a big bang for buck out of it. Now, looking at your code - it looks as though the number you create is essentially a base-256 version of the number... is that correct? (clever, that - nicely done)
nutballs
yes it is base 256. never actually realized that lol. To Convert to INT just break the octets apart then SUM(256^n-1) A.B.C.D (A*256^3)+(B*256^2)+(C*256)+D to convert back ((number/256^3) % 256).((number/256^2) % 256).((number/256) % 256).(number % 256). where % is the modulus function of your language. so your just getting the remainder of each octet. Mod is limited to INT in asp though so my code chunk does some funky math to compensate.
perkiset
I can't get shit done today so I thought I'd adapt this routine for php - it can be seen here: http://www.perkiset.org/forum/ php /converting_ip_adrresses_to_integers_and_back-t87.0.html NBs - it's pretty tight and would deserve a look so that you can see ASP -> PHP equivalent code. All that being said, I'd love to see a MySQL or Oracle guy weigh in and show the code to do this at the database level... /p
thedarkness
I wonder how it would go written as a UDF? http://en.wikipedia.org/wiki/User_Defined_Function or, as a procedure http://dev.mysql.com/doc/refman/5.0/en/writing-a-procedure.html What's the thoughts on overhead/speed advantages/disadvantages ? It would constitute an interesting exercise and reminds me a lot of a checksumming algorithm I wrote many moons ago. If consensus is that it would indeed be faster/less overhead I would be happy to have a crack at it. Cheers, td
nutballs
I personally try to keep as much data manipulation off the SQL just because i like my SQL to do the one thing its good at, lookups. But thats really just personal preference. obviously the less data you push across a net work, the better. so it really depends on the situation. in this case, pulling the manipulation off the SQL box is probably better, if... they are on seperate boxes.
thedarkness
Yeah, like so much of this shit it's "a little from category A and a little from category B"  Cheers, td
perkiset
quote author=nutballs link=topic=65.msg392#msg392 date=1177388657 I personally try to keep as much data manipulation off the SQL just because i like my SQL to do the one thing its good at, lookups. But thats really just personal preference. obviously the less data you push across a
net work, the better. so it really depends on the situation. in this case, pulling the manipulation off the SQL box is probably better, if... they are on seperate boxes.
Hooboy. dbDog is gonna take it up with you. He's a big "Put it all behind the database as stored procedures" kind of guy. And he's huge, and he's mean and he's continually cracked out and ready for a fight. :checks to see if dbDog is back from Hawaii yet: :  /p
nutballs
i will probably agree, as will he with me in some respects. let me clarify, using two extremes. it would be dumb to do a Select * from table, then find the record you want by looping through the records until you find the one you want on the client/application side. Obviously you want to sort and filter and output, ONLY the data your app/client needs. however. for something like the IP lookup. transforming your filter value BEFORE you send the request to the SQL server makes sense. Not quite an extreme example, but good enough. another reason to keep the "code" out of the database is to keep it system agnostic. That may just be a side effect of alot of the work I do which is relocation type projects.
dirk
I have often used the MySQL functions for the conversion: SELECT INET _ATON('255.255.255.255'); -> 4294967295 SELECT INET _NTOA(4294967295); -> 255.255.255.255 If an INT UNSIGNED column is used it's not necessary to use BIGINT. So only 4 bytes are required instead of 16 (VARCHAR).
nutballs
ah yes, mysql has signed and unsigner. i forgot about that. so unsigned int works. in MSsql, bigint works. which is still only 8bytes compared to the 16 for varchar.(32 if your doing nvarchar i think) @perk- where are you getting 20 bytes for bigint? the following is MSsql, i doubt though that mysql is any different, except for the signed/unsigned capability. bigint - Integer data from -2^63 (-9,223,372,036,854,775,80  through 2^63-1 (9,223,372,036,854,775,807). Storage size is 8 bytes. int - Integer data from -2^31 (-2,147,483,64  through 2^31 - 1 (2,147,483,647). Storage size is 4 bytes. smallint - Integer data from -2^15 (-32,76  through 2^15 - 1 (32,767). Storage size is 2 bytes. tinyint - Integer data from 0 through 255. Storage size is 1 byte.
perkiset
quote author=dirk link=topic=65.msg1028#msg1028 date=1178576095 SELECT
INET _ATON('255.255.255.255'); SELECT INET _NTOA(4294967295);  sweetness. quote author=nutballs link=topic=65.msg1029#msg1029 date=1178576781 @perk- where are you getting 20 bytes for bigint? the following is MSsql, i doubt though that mysql is any different, except for the signed/unsigned capability.
lol I just created one and looked at the size of it. Dorkometer has pegged the needle. Here is the actual storage sizes from the MySQL site:
m0nkeymafia
I always thought mysql was much faster at most calcs than php , plus if you have caching on and the query isnt too big itll save it for you. Also with regards to indexes i think you have to do them in a particular order i.e. if you index col 1, col2 then you need to use them in that order in your query for the index to take full effect At least thats what I read and it seemed to work.
nutballs
quote author=m0nkeymafia link=topic=65.msg1158#msg1158 date=1178837245 Also with regards to indexes i think you have to do them in a particular order i.e. if you index col 1, col2 then you need to use them in that order in your query for the index to take full effect
i hope not... lol
perkiset
quote author=m0nkeymafia link=topic=65.msg1158#msg1158 date=1178837245 I always thought mysql was much faster at most calcs than
php , plus if you have caching on and the query isnt too big itll save it for you.
That may be true (I don't know, honestly) for raw processing because MySQL will be completely compiled and PHP is scripted... but if you add the overhead of getting from a web call in <x> language and getting PHP to do a calc for you as opposed to doing it the language itself that might be different. Also, if you employ script cacheing ala APC and such, I think the calc-speed benchmarks start to get a lot closer quote author=m0nkeymafia link=topic=65.msg1158#msg1158 date=1178837245 Also with regards to indexes i think you have to do them in a particular order i.e. if you index col 1, col2 then you need to use them in that order in your query for the index to take full effect
Perhaps that is true with really complicated WHERE clauses, but for the most part, I'd wager that MySQL is smart enough to see that it can use <the most appropriate index> on the query regardless of usage order...
m0nkeymafia
Ah yeah I hadnt figured that, the queries im talking about are biiiig lol Actually thats a good point, I class these as big cos theyre about 40 lines long and join like 10 tables and so various calcs and groupings on the data etc. Is that what you guys would call "big" too, or is that childs play? lol
perkiset
10 joins? That's a pretty healthy query. I'd definitely EXPLAIN that SQL to make sure that MySQL has a good attack plan. /p
nutballs
lol a 10 way join. thats pretty hefty. largest i ever have done was i think a 16 way. if you are having execution timing issues, you might consider breaking it up. I have found some weird issues with joins, where even though it should not be the case, multiple lookups is signifcantly faster. this also could be bad indexes on my part, but it has happened enough times to me to no longer be a fluke.
m0nkeymafia
Well I think I optimised it quite well Its for a league table So needs to look up the league itself, the league players [its a simple joining table to show whose in what league] then the players themselves then the game table [thers 50k+ in there] then the team table then some other stuff then it needs to do all calculations like figure out whose top, goal difference, who won etc etc took me ages to write lol
thedarkness
Would a view help in this situation? Or maybe it's time to redesign/rationalise that particular database schema? Cheers, td
|