![]() |
Indica
i've now ran into my first major database roadblock
how to optimize mysql databases for very huge databases. i've got a table with 18.8mil rows of user ids. i plan to check these user ids to see if they're online or not, and will be doing this from several servers and thus need a phpscript to send each server a few names to check. i suppose before i explain the script and the bottleneck i should give the structure:CREATE TABLE `userids` ( `id` int(10) unsigned NOT NULL auto_increment, `userid` varchar(16) NOT NULL, `last_check_date` datetime NOT NULL, `sent_date` datetime NOT NULL, `offline_count` tinyint(4) NOT NULL COMMENT 'number of times sn was offline', `status` varchar(1) NOT NULL default '0' COMMENT '0 = ready, 1 = sent for checking, 2 = checked, 3 = inactive', PRIMARY KEY (`id`), UNIQUE KEY `sn` (`sn`), KEY `last_check_date` (`last_check_date`,`sent_date`,`offline_count`,`status`), FULLTEXT KEY `status` (`status`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=18806132 ; to sum this up, last_check_date is the date the user's online status was last checked, and sent_time was the time that userid was sent to one of my servers. sent_time is used in combination with status (1) for pruning in the event that my server doesn't send back the online status for whatever reason. and this causes the problem. i ran my phpscript to send 5 user ids to my other server, so the first 5 rows in this database had status values 1 and had a sent_date set. now my prune time is set to 5 minutes, so it should reset user ids that haven't gotten a response in 5 minutes back to 0. below is the sql:UPDATE `userids` SET status=0 WHERE status LIKE '%1%' AND sent_date <= '2007-11-03 01:34:05' i've just run this command and the result: Affected rows: 5 (Query took 59.0590 sec) ![]() for a phpscript that is to be ran every 5 minutes, this isn't pretty.i know there's some sql gurus around here, can anyone throw me a few bones about how to optimize this db and/or the query? i think it's the sent_date causing things to slow down a ton, but i'm not 100% sure. any tips are appreciated
thedarkness
Something like;
EXPLAIN SELECT status FROM `usersids` WHERE status LIKE '%1%' AND sent_date <= '2007-11-03 01:34:05' Might be a good place to start, post the results here to get the discussion rolling. HTH, td thedarkness
Holy shit, just noticed status is a varchar(1)...... wtf for?
Change it to a tinyint(1) and use status = 1 and I would be surprised if you didn't see an immediate and considerable improvement in query performance. Always favour queries based on numeric data types if you can. You may also want to look at the INNODB engine as opposed to MyISAM. Test, test, test. Post the results of the explain here anyway and don't assume I am the be all and end all of datbase optimizers ![]() My 2c, td [edit]Sorry about the double post[/edit] perkiset
Also, the status like '%1%' is just a huge no-no - you essentially killed any index that could have helped you.
First, TD is right - either make that a tiny int or a CHAR 1. Then build an index on status plus the id (PK). This will make the cardinality of the index perfect. Then do your update as where status = '1' or = 1 and it will rock. /p nutballs
agreed with the others:
change status to tinyint get rid of the like: WHERE status=1 that will eliminate the fulltext index on status you then also dont need to key all those things since both the offlinecount and status are int. (mysql auto indexes numeric types right? MSSQL does, but I cant remember if mySQL.) I think the dates are also autoindexed since they are just numeric anyway technically. so I don't think you need to actually key/index those columns at all. Unless mySQL doesnt autoindex, then nevermind ![]() perkiset
MySQL does not autoindex anything as far as I know, you must explicitly create them. And actually it's not a fulltext index that is created but rather a
regexof sorts that is applied to every row in the table for evaluation, which is why it's taking so freaking long. By using the status+id index it will be trivially fast.Indica
late night (early morning?) coding will do this to you
![]() i've changed the type, and the indexes and speed seems to have gone up a little. i tested the db with 125,000 rows set to be pruned, the result: Affected rows: 125025 (Query took 61.8150 sec) i've refilled the db with the 125,000 test rows and here's the EXPLAIN results: quote EXPLAIN SELECT STATUS FROM `userids` WHERE STATUS =1 AND sent_date <= '2007-11-03 01:34:05' id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE userids index NULL last_check_date 18 NULL 18806131 Using where; Using index nutballs
quote author=perkiset link=topic=603.msg4046#msg4046 date=1194112629 MySQL does not autoindex anything as far as I know, you must explicitly create them. ah ok, good to know. lol perkiset
... it took 61 seconds to do 125K updates? That still seems a bit high... that's only about 2K/sec... maybe I'm just delierous from doing the bills.
Indica
yeah, that's correct perk
i decided to try again with more rows, 170k this time. i don't expect to have to prune many rows, but i'd like to be prepared for anything.i can select the rows very quickly, but updating kills it. for some reason phpmyadmin only says it takes under a second to complete, but my browser certainly takes ~10 seconds to complete so i thinkphpmyadmin is lieing![]() SELECT id FROM `userids` WHERE STATUS =1 AND sent_date <= '2007-11-03 17:58:44' Showing rows 0 - 29 (170,750 total, Query took 0.0021 sec) and now the update.. UPDATE `userids` SET status=0 WHERE status=1 AND sent_date <= '2007-11-03 17:58:44' Affected rows: 170750 (Query took 59.5929 sec) i do not plan to add any more rows to this database, and i plan to move the ones that were found to be online to another table to hopefully speed things up. i think the next test would be going to INNODB. i can't imagine how slow it would take to data mine 200mil records of user information maybe it's time i look into cubes
perkiset
Too bad the DBDog has not been around... he'd have a suggestion. Sorry man, I'm about fresh out of ideas on that one.
thedarkness
id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE userids index NULL last_check_date 18 NULL 18806131 Using where; Using index Why is last_check_date even mentioned here? Why is possible_keys NULL? Why is ref NULL? This has exceeded my knowledge but something looks fishy here, to me at least. Indica, try creating an index on sent_date Cheers, td [edit] I really need to learnmore about explain. Let's start a thread purely for information on interpretting what explain is trying to tell you? [/edit]thedarkness
Sorry for being a serial double poster in this thread :-D
Indica: Try the select with "EXPLAIN EXTENDED" prepended and post the results here, I'm really interested in this so take advantage while I'm "hot" ![]() Dangers of a short attention span ![]() Cheers, td [edit] After you issue the "EXPLAIN EXTENDED SELECT....." statement you need to issue a "SHOW WARNINGS" command to get the output? From; http://dev.mysql.com/doc/refman/5.1/en/explain.html "When the EXTENDED keyword is used, EXPLAIN produces extra information that can be viewed by issuing a SHOW WARNINGS statement following the EXPLAIN statement. This information displays how the optimizer qualifies table and column names in the SELECT statement, what the SELECT looks like after the application of rewriting and optimization rules, and possibly other notes about the optimization process. EXPLAIN EXTENDED also displays the filtered column as of MySQL 5.1.12." you also may need to issue a "ANALYZE TABLE" to update cardinality, etc. if you make changes. Interestingly the possible_keys section states; " If this column is NULL, there are no relevant indexes. In this case, you may be able to improve the performance of your query by examining the WHERE clause to check whether it refers to some column or columns that would be suitable for indexing. If so, create an appropriate index and check the query with EXPLAIN again. See Section 12.1.4, “ALTER TABLE Syntax”. To see what indexes a table has, use SHOW INDEX FROM tbl_name. " Which would tend to indicate that your query can not find a suitable index to use and this I would say is critical........ HTH [/edit] thedarkness
haha, i kill me
![]() Also read, http://dev.mysql.com/tech-resources/articles/performance-partitioning.html and http://dev.mysql.com/tech-resources/articles/mysql_5.1_partitioning.html Looks like partitioning may present a possible (substantial) performance increase if we can work out a good way to lay it out. Post the "EXPLAIN EXTENDED" output first and let's get that as optimized as possible before we go too far down the partition road. Told you I was "hot" ![]() Cheers, td Indica
quote author=thedarkness link=topic=603.msg4052#msg4052 date=1194134199 Why is last_check_date even mentioned here? that's the name of the index, phpmyadmin did thisquote author=thedarkness link=topic=603.msg4052#msg4052 date=1194134199 Why is possible_keys NULL? Why is ref NULL? good questions and i'm not sure. never had to tweak a db like this before, most behave the way they should sent_date has an index, it's under last_check_date with last_check_date, sent_date, offline_count, and status. when i created the table i think phpmyadmin clumped them all together, if they're supposed to be like this or on their own i'm not sure.here's explain extended: EXPLAIN EXTENDED SELECT STATUS FROM `userids` WHERE STATUS =1 AND sent_date <= '2007-11-03 01:34:05' id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE userids index NULL last_check_date 18 NULL 18806131 Using where; Using index SHOW WARNINGS showed nothing. so i then did the analyze Table Op Msg_type Msg_text oldids.userids analyze status OK and show index: Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment userids 0 PRIMARY 1 id A 18806131 NULL NULL BTREE userids 0 sn 1 sn A 18806131 NULL NULL BTREE userids 1 last_check_date 1 last_check_date A 1 NULL NULL BTREE userids 1 last_check_date 2 sent_date A 934 NULL NULL BTREE userids 1 last_check_date 3 offline_count A 934 NULL NULL BTREE userids 1 last_check_date 4 status A 934 NULL NULL BTREE userids 1 id 1 id A 18806131 NULL NULL BTREE userids 1 id 2 status A 18806131 NULL NULL BTREE there's already an index on both status and sent_date? ![]() as for partitioning, yeah i've considered this. i was going to make a table for the a usernames, b usernames, etc. i would prefer not to as i'm going to use a similar system for another project, but if it's the way it has to be i guess that's it. would be nice not to have to reimport this entire thing, damn db took 4 days to import as it is ![]() if none of this improves speed, i'll try switching to innodb. and if that is still shit, ill try partitioning. thanks for the help thus far
thedarkness
Some observations and things to try (these examples are from a famous mysql test database called sakila);
CREATE TABLE `actor` ( `actor_id` smallint(5) unsigned NOT NULL auto_increment, `first_name` varchar(45) NOT NULL, `last_name` varchar(45) NOT NULL, `last_update` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP, PRIMARY KEY (`actor_id`), KEY `idx_actor_last_name` (`last_name`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 mysql> show index from actor; +-------+------------+---------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | +-------+------------+---------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | actor | 0 | PRIMARY | 1 | actor_id | A | 200 | NULL | NULL | | BTREE | | | actor | 1 | idx_actor_last_name | 1 | last_name | A | 200 | NULL | NULL | | BTREE | | +-------+------------+---------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ 2 rows in set (0.00 sec) mysql> explain select first_name from actor where actor_id < 11 and last_update < '2006-02-15 04:34:34'; +----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+ | 1 | SIMPLE | actor | range | PRIMARY | PRIMARY | 2 | NULL | 10 | Using where | +----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+ 1 row in set (0.00 sec) These results look more like what I would expect than yours do. Actually, just looking at this it looks to me like last_check_date is a multiple-column index using sent_date, offline_count and status, there's also one composed of id and status so I'd say that as explain was trying to tell us there was no suitable index for the query. Try this; create index stat on userids (status); // may use, may improve speed create index sent on userids (sent_date); // may use, may improve speed, may use in conjunction with stat? create index sent_stat on userids (status,sent_date); // Order important? create index stat_sent on userids (sent_date,status); // Looks like the winner, test them all though Now you may have to try each one uniquely (not sure here), so create one and delete the others then test with explain to see whether it's using an index and what your speed is like. If you throw all the indexes in there together it may be smart enough to pick the best one (prolly is) but it may just muddy the waters so I'd do each in isolation and then all together to see if it uses more than one etc. Given that we are creating these indexes explicitly on the columns the queries are based on they should help a lot. Oh and use analyze (help analyze table) and explain a lot and maybe consider optimize (help optimize table) if you suspect you have "holes" in your data. Report back dude. Cheers, td thedarkness
Don't leave us hanging Indica, what's the scoop?
Cheers, td Indica
i'm going to work on this database some more this weekend. very busy week, i haven't forgotten.
i got a feeling i will have to partition it, i'd rather not do this because of future projects that will be on a similar scale. they won't partition well due to the nature of the data. and if i can't get a 18mil record db going smooth(er), i don't want to think of having 200mil records with gbs of data
nop_90
I had a similar problem when i made a program that handled newspaper subscribers there was like 100K of them.
Subscribers could stop and start daily The program had to run on a 486. The problem is the updates. That is what killed me. The db schema for each user was like this. route number,subscriber id, name etc and i was all in one table. there where approx 500 routes (i can not remember details). i then made like 500 tables with same scheme as above but with each table only having users of same route. I made a class that would handle all the housekeeping (proggie was in pythonand this was before sqlobject was invented).that way when new user was added etc it would go to approriate route table. Also the class on the surface made it ap pearas if all in 1 table.Indica
well, i am now partitioning the large table up into tables for each letter. will give an update once things are finished importing.
thedarkness
Did you try creating those indexes and testing them as per my large post on this page Indica? All evidence would suggest that that query was running without using an index, if you can get it to use a suitable index I would expect it to make a hell of a difference........
Cheers, td chide
Don't bother with partitioning.. I've had databases with 300mil rows and it didn't make any substantial difference.
One major thing to look at though: how big is your key buffer? If you're playing with tables in the tens of millions you'll want at least 256mb. If it's an InnoDB table (which I'd use for a table that big) make sure to set InnoDB's buffer pool to at least 256mb. The server should have at least a gig of ram in it. More memory makes everything faster.
nutballs
id agree. In MsSQL i did a database that processed 1-2 gigs of weblogs a day. No partitioning.
In my experience, proper organization of the tables and indexes can be just as effective as partitioning, without the coding legwork to deal with it. Though of course in msSQL, there is auto partitioning(i cant remember what the hell its called for some reason), so you can do that if push comes to shove. I assume MySQL doesnt do that though. Indica
quote author=chide link=topic=603.msg4235#msg4235 date=1195166644 Don't bother with partitioning.. I've had databases with 300mil rows and it didn't make any substantial difference. One major thing to look at though: how big is your key buffer? If you're playing with tables in the tens of millions you'll want at least 256mb. If it's an InnoDB table (which I'd use for a table that big) make sure to set InnoDB's buffer pool to at least 256mb. The server should have at least a gig of ram in it. More memory makes everything faster. ![]() the server this db is on hasn't got much ram. i've been testing on my own test server which does have 1gb ram. partitioning seems to have helped. a prune of 125k userids took 10 seconds, before it would take 60. i will try increasing the buffer and switching to innodb, if no increase is seen with this new change, well, 10 seconds is fine. it seems my version of sql doesn't have innodb as a storage engine. i'll have to look into installing it. chide
What version of mysql are you using? Version 5+ comes with InnoDB standard. I wouldn't use older stuff anyway.. they've made a ton of improvements in the last few years.
thedarkness
Get your indexes right or it will come back to bite you.
Cheers, td perkiset
quote author=thedarkness link=topic=603.msg4244#msg4244 date=1195210890 Get your indexes right or it will come back to bite you. ... and with the size of that database, it'll be a great white bite, not a mosquito
Indica
am using mysql 5.0.20, xampp setup.
i'll fix the indexes, before they bite my leg off
nop_90
The size of the db does not matter when it comes to queries, (or matter very little).
What kills u is when u have to update. Or that was the case on mysql 3 When u update the table it has to lock and then unlock. chide
InnoDB does row level locking so it's not an issue anymore if you use that table format.
perkiset
InnoDB, however, cuts into performance pretty hard. The largest reason for InnoDB is if you need transactions. If you are only doing relatively simple DB work, then the default handler is fine. And the table-locking issue only really becomes an issue if you have a high-thrash database where lots of processes are trying to update simultaneously. The lock/release cycle is actually really quick, but it does hold up all other process from writing. Lock/release does not affect reading the table, however, so if you have a high-read/low-write application, the defaults are *very* adequate.
chide
That used to be the case.. InnoDB has gotten a LOT better though:
http://www.mysqlperformanceblog.com/2007/01/08/innodb-vs-myisam-vs-falcon-benchmarks-part-1/ Though YMMV, of course. Indica
quote author=perkiset link=topic=603.msg4250#msg4250 date=1195345227 If you are only doing relatively simple DB work, then the default handler is fine. And the table-locking issue only really becomes an issue if you have a high-thrash database where lots of processes are trying to update simultaneously. the majority of the queries are just selecting data and updating one field, nothing too complex. just a lot of select/updates. chide
Either way.. give it LOTS of cache. (but not too much that the OS starts swapping it to disk)
|
Thread Categories
| Best of The Cache Home | ||
| Search The Cache |
- Ajax
- Apache & mod_rewrite
- BlackHat SEO & Web Stuff
- C/++/#, Pascal etc.
- Database Stuff
- General & Non-Technical Discussion
- General programming, learning to code
- Javascript Discussions & Code
- Linux Related
- Mac, iPhone & OS-X Stuff
- Miscellaneous
- MS Windows Related
- PERL & Python Related
- PHP: Questions & Discussion
- PHP: Techniques, Classes & Examples
- Regular Expressions
- Uncategorized Threads



how to optimize mysql databases for very huge databases. 


i decided to try again with more rows, 170k this time. i don't expect to have to prune many rows, but i'd like to be prepared for anything.
maybe it's time i look into cubes

