Indica

i've now ran into my first major database roadblock Applause how to optimize mysql databases for very huge databases.

i've got a table with 18.8mil rows of user ids. i plan to check these user ids to see if they're online or not, and will be doing this from several servers and thus need a

php

  script to send each server a few names to check. i suppose before i explain the script and the bottleneck i should give the structure:

CREATE TABLE `userids` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `userid` varchar(16) NOT NULL,
  `last_check_date` datetime NOT NULL,
  `sent_date` datetime NOT NULL,
  `offline_count` tinyint(4) NOT NULL COMMENT 'number of times sn was offline',
  `status` varchar(1) NOT NULL default '0' COMMENT '0 = ready, 1 = sent for checking, 2 = checked, 3 = inactive',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `sn` (`sn`),
  KEY `last_check_date` (`last_check_date`,`sent_date`,`offline_count`,`status`),
  FULLTEXT KEY `status` (`status`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=18806132 ;


to sum this up, last_check_date is the date the user's online status was last checked, and sent_time was the time that userid was sent to one of my servers. sent_time is used in combination with status (1) for pruning in the event that my server doesn't send back the online status for whatever reason. and this causes the problem. i ran my

php

  script to send 5 user ids to my other server, so the first 5 rows in this database had status values 1 and had a sent_date set. now my prune time is set to 5 minutes, so it should reset user ids that haven't gotten a response in 5 minutes back to 0. below is the sql:

UPDATE `userids` SET status=0 WHERE status LIKE '%1%' AND sent_date <= '2007-11-03 01:34:05'


i've just run this command and the result: Affected rows: 5 (Query took 59.0590 sec) Applause

for a

php

  script that is to be ran every 5 minutes, this isn't pretty.

i know there's some sql gurus around here, can anyone throw me a few bones about how to optimize this db and/or the query? i think it's the sent_date causing things to slow down a ton, but i'm not 100% sure. any tips are appreciated  Applause

thedarkness

Something like;

EXPLAIN SELECT status FROM `usersids` WHERE status LIKE '%1%' AND sent_date <= '2007-11-03 01:34:05'


Might be a good place to start, post the results here to get the discussion rolling.

HTH,
td

thedarkness

Holy shit, just noticed status is a varchar(1)...... wtf for?

Change it to a tinyint(1) and use status = 1 and I would be surprised if you didn't see an immediate and considerable improvement in query performance.

Always favour queries based on numeric data types if you can.

You may also want to look at the INNODB engine as opposed to MyISAM. Test, test, test.

Post the results of the explain here anyway and don't assume I am the be all and end all of datbase optimizers  Applause

My 2c,
td

[edit]Sorry about the double post[/edit]

perkiset

Also, the status like '%1%' is just a huge no-no - you essentially killed any index that could have helped you.

First, TD is right - either make that a tiny int or a CHAR 1. Then build an index on status plus the id (PK). This will make the cardinality of the index perfect.

Then do your update as where status = '1' or = 1 and it will rock.

/p

nutballs

agreed with the others:
change status to tinyint
get rid of the like: WHERE status=1

that will eliminate the fulltext index on status
you then also dont need to key all those things since both the offlinecount and status are int. (mysql auto indexes numeric types right? MSSQL does, but I cant remember if mySQL.) I think the dates are also autoindexed since they are just numeric anyway technically. so I don't think you need to actually key/index those columns at all. Unless mySQL doesnt autoindex, then nevermind  Applause

perkiset

MySQL does not autoindex anything as far as I know, you must explicitly create them. And actually it's not a fulltext index that is created but rather a

regex

  of sorts that is applied to every row in the table for evaluation, which is why it's taking so freaking long. By using the status+id index it will be trivially fast.

Indica

late night (early morning?) coding will do this to you Applause

i've changed the type, and the indexes and speed seems to have gone up a little. i tested the db with 125,000 rows set to be pruned, the result:

Affected rows: 125025 (Query took 61.8150 sec)


i've refilled the db with the 125,000 test rows and here's the EXPLAIN results:

quote

EXPLAIN SELECT STATUS FROM `userids`
WHERE STATUS =1
AND sent_date <= '2007-11-03 01:34:05'

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE userids index NULL last_check_date 18 NULL 18806131 Using where; Using index

nutballs

quote author=perkiset link=topic=603.msg4046#msg4046 date=1194112629

MySQL does not autoindex anything as far as I know, you must explicitly create them.

ah ok, good to know. lol

perkiset

... it took 61 seconds to do 125K updates? That still seems a bit high... that's only about 2K/sec... maybe I'm just delierous from doing the bills.

Indica

yeah, that's correct perk Applause i decided to try again with more rows, 170k this time. i don't expect to have to prune many rows, but i'd like to be prepared for anything.

i can select the rows very quickly, but updating kills it. for some reason

php

 myadmin only says it takes under a second to complete, but my browser certainly takes ~10 seconds to complete so i think

php

 myadmin is lieing  Applause

SELECT id FROM `userids` WHERE STATUS =1 AND sent_date <= '2007-11-03 17:58:44'
Showing rows 0 - 29 (170,750 total, Query took 0.0021 sec)


and now the update..

UPDATE `userids` SET status=0 WHERE status=1 AND sent_date <= '2007-11-03 17:58:44'
Affected rows: 170750 (Query took 59.5929 sec)


i do not plan to add any more rows to this database, and i plan to move the ones that were found to be online to another table to hopefully speed things up. i think the next test would be going to INNODB.

i can't imagine how slow it would take to data mine 200mil records of user information  Applause maybe it's time i look into cubes

perkiset

Too bad the DBDog has not been around... he'd have a suggestion. Sorry man, I'm about fresh out of ideas on that one.

thedarkness


id    select_type    table    type    possible_keys    key    key_len    ref    rows    Extra
1    SIMPLE    userids    index    NULL    last_check_date    18    NULL    18806131    Using where; Using index


Why is last_check_date even mentioned here?

Why is possible_keys NULL?

Why is ref NULL?

This has exceeded my knowledge but something looks fishy here, to me at least.

Indica, try creating an index on sent_date

Cheers,
td

[edit] I really need to

learn

  more about explain. Let's start a thread purely for information on interpretting what explain is trying to tell you? [/edit]

thedarkness

Sorry for being a serial double poster in this thread :-D

Indica: Try the select with "EXPLAIN EXTENDED" prepended and post the results here, I'm really interested in this so take advantage while I'm "hot"  Applause

Dangers of a short attention span  Applause

Cheers,
td

[edit]
After you issue the "EXPLAIN EXTENDED SELECT....." statement you need to issue a "SHOW WARNINGS" command to get the output?

From; http://dev.mysql.com/doc/refman/5.1/en/explain.html

"When the EXTENDED keyword is used, EXPLAIN produces extra information that can be viewed by issuing a SHOW WARNINGS  statement following the EXPLAIN statement. This information displays how the optimizer qualifies table and column names in the SELECT statement, what the SELECT looks like after the application of rewriting and optimization rules, and possibly other notes about the optimization process. EXPLAIN EXTENDED also displays the filtered  column as of MySQL 5.1.12."

you also may need to issue a "ANALYZE TABLE" to update cardinality, etc. if you make changes.

Interestingly the possible_keys section states;
" If this column is NULL, there are no relevant indexes. In this case, you may be able to improve the performance of your query by examining the WHERE clause to check whether it refers to some column or columns that would be suitable for indexing. If so, create an appropriate index and check the query with EXPLAIN again. See Section 12.1.4, “ALTER TABLE Syntax”.

To see what indexes a table has, use SHOW INDEX FROM tbl_name. "

Which would tend to indicate that your query can not find a suitable index to use and this I would say is critical........

HTH
[/edit]

thedarkness

haha, i kill me  Applause

Also read, http://dev.mysql.com/tech-resources/articles/performance-partitioning.html and http://dev.mysql.com/tech-resources/articles/mysql_5.1_partitioning.html

Looks like partitioning may present a possible (substantial) performance increase if we can work out a good way to lay it out.

Post the "EXPLAIN EXTENDED" output first and let's get that as optimized as possible before we go too far down the partition road.

Told you I was "hot"  Applause

Cheers,
td

Indica

quote author=thedarkness link=topic=603.msg4052#msg4052 date=1194134199

Why is last_check_date even mentioned here?

that's the name of the index,

php

 myadmin did this

quote author=thedarkness link=topic=603.msg4052#msg4052 date=1194134199

Why is possible_keys NULL?

Why is ref NULL?

good questions and i'm not sure. never had to tweak a db like this before, most behave the way they should

sent_date has an index, it's under last_check_date with last_check_date, sent_date, offline_count, and status. when i created the table i think

php

 myadmin clumped them all together, if they're supposed to be like this or on their own i'm not sure.

here's explain extended:
EXPLAIN EXTENDED SELECT STATUS FROM `userids`
WHERE STATUS =1
AND sent_date <= '2007-11-03 01:34:05'

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE userids index NULL last_check_date 18 NULL 18806131 Using where; Using index


SHOW WARNINGS showed nothing. so i then did the analyze

Table Op Msg_type Msg_text
oldids.userids analyze status OK


and show index:

Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment
userids 0 PRIMARY 1 id A 18806131 NULL NULL   BTREE
userids 0 sn 1 sn A 18806131 NULL NULL   BTREE
userids 1 last_check_date 1 last_check_date A 1 NULL NULL   BTREE
userids 1 last_check_date 2 sent_date A 934 NULL NULL   BTREE
userids 1 last_check_date 3 offline_count A 934 NULL NULL   BTREE
userids 1 last_check_date 4 status A 934 NULL NULL   BTREE
userids 1 id 1 id A 18806131 NULL NULL   BTREE
userids 1 id 2 status A 18806131 NULL NULL   BTREE


there's already an index on both status and sent_date?  Applause

as for partitioning, yeah i've considered this. i was going to make a table for the a usernames, b usernames, etc. i would prefer not to as i'm going to use a similar system for another project, but if it's the way it has to be i guess that's it. would be nice not to have to reimport this entire thing, damn db took 4 days to import as it is Applause

if none of this improves speed, i'll try switching to innodb. and if that is still shit, ill try partitioning. thanks for the help thus far  Applause

thedarkness

Some observations and things to try (these examples are from a famous mysql test database called sakila);

CREATE TABLE `actor` (
  `actor_id` smallint(5) unsigned NOT NULL auto_increment,
  `first_name` varchar(45) NOT NULL,
  `last_name` varchar(45) NOT NULL,
  `last_update` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  PRIMARY KEY  (`actor_id`),
  KEY `idx_actor_last_name` (`last_name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8


mysql> show index from actor;
+-------+------------+---------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name            | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+---------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| actor |          0 | PRIMARY            |            1 | actor_id    | A        |        200 |    NULL | NULL  |      | BTREE      |        |
| actor |          1 | idx_actor_last_name |            1 | last_name  | A        |        200 |    NULL | NULL  |      | BTREE      |        |
+-------+------------+---------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
2 rows in set (0.00 sec)


mysql> explain select first_name from actor where actor_id < 11 and last_update < '2006-02-15 04:34:34';
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key    | key_len | ref  | rows | Extra      |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | actor | range | PRIMARY      | PRIMARY | 2      | NULL |  10 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)


These results look more like what I would expect than yours do.

Actually, just looking at this it looks to me like last_check_date is a multiple-column index using sent_date, offline_count and status, there's also one composed of id and status so I'd say that as explain was trying to tell us there was no suitable index for the query.

Try this;

create index stat on userids (status); // may use, may improve speed

create index sent on userids (sent_date); // may use, may improve speed, may use in conjunction with stat?

create index sent_stat on userids (status,sent_date); // Order important?

create index stat_sent on userids (sent_date,status); // Looks like the winner, test them all though

Now you may have to try each one uniquely (not sure here), so create one and delete the others then test with explain to see whether it's using an index and what your speed is like. If you throw all the indexes in there together it may be smart enough to pick the best one (prolly is) but it may just muddy the waters so I'd do each in isolation and then all together to see if it uses more than one etc.

Given that we are creating these indexes explicitly on the columns the queries are based on they should help a lot.

Oh and use analyze (help analyze table) and explain a lot and maybe consider optimize (help optimize table) if you suspect you have "holes" in your data.

Report back dude.

Cheers,
td

thedarkness

Don't leave us hanging Indica, what's the scoop?

Cheers,
td

Indica

i'm going to work on this database some more this weekend. very busy week, i haven't forgotten.

i got a feeling i will have to partition it, i'd rather not do this because of future projects that will be on a similar scale. they won't partition well due to the nature of the data. and if i can't get a 18mil record db going smooth(er), i don't want to think of having 200mil records with gbs of data  Applause

nop_90

I had a similar problem when i made a program that handled newspaper subscribers there was like 100K of them.
Subscribers could stop and start daily
The program had to run on a 486.
The problem is the updates. That is what killed me.

The db schema for each user was like this.
route number,subscriber id, name etc
and i was all in one table.
there where approx 500 routes (i can not remember details).

i then made like 500 tables with same scheme as above but with each table only having users of same route.

I made a class that would handle all the housekeeping (proggie was in

python

  and this was before sqlobject was invented).
that way when new user was added etc it would go to approriate route table.

Also the class on the surface made it ap

pear

  as if all in 1 table.

Indica

well, i am now partitioning the large table up into tables for each letter. will give an update once things are finished importing.

thedarkness

Did you try creating those indexes and testing them as per my large post on this page Indica? All evidence would suggest that that query was running without using an index, if you can get it to use a suitable index I would expect it to make a hell of a difference........

Cheers,
td

chide

Don't bother with partitioning.. I've had databases with 300mil rows and it didn't make any substantial difference.

One major thing to look at though:  how big is your key buffer?  If you're playing with tables in the tens of millions you'll want at least 256mb.  If it's an InnoDB table (which I'd use for a table that big) make sure to set InnoDB's buffer pool to at least 256mb.  The server should have at least a gig of ram in it.

More memory makes everything faster.  Applause

nutballs

id agree. In MsSQL i did a database that processed 1-2 gigs of weblogs a day. No partitioning.

In my experience, proper organization of the tables and indexes can be just as effective as partitioning, without the coding legwork to deal with it.

Though of course in msSQL, there is auto partitioning(i cant remember what the hell its called for some reason), so you can do that if push comes to shove. I assume MySQL doesnt do that though.

Indica

quote author=chide link=topic=603.msg4235#msg4235 date=1195166644

Don't bother with partitioning.. I've had databases with 300mil rows and it didn't make any substantial difference.

One major thing to look at though:  how big is your key buffer?  If you're playing with tables in the tens of millions you'll want at least 256mb.  If it's an InnoDB table (which I'd use for a table that big) make sure to set InnoDB's buffer pool to at least 256mb.  The server should have at least a gig of ram in it.

More memory makes everything faster.  Applause


the server this db is on hasn't got much ram. i've been testing on my own test server which does have 1gb ram. partitioning seems to have helped. a prune of 125k userids took 10 seconds, before it would take 60. i will try increasing the buffer and switching to innodb, if no increase is seen with this new change, well, 10 seconds is fine.

it seems my version of sql doesn't have innodb as a storage engine. i'll have to look into installing it.

chide

What version of mysql are you using?  Version 5+ comes with InnoDB standard.  I wouldn't use older stuff anyway.. they've made a ton of improvements in the last few years.

thedarkness

Get your indexes right or it will come back to bite you.

Cheers,
td

perkiset

quote author=thedarkness link=topic=603.msg4244#msg4244 date=1195210890

Get your indexes right or it will come back to bite you.


... and with the size of that database, it'll be a great white bite, not a mosquito Applause

Indica

am using mysql 5.0.20, xampp setup.

i'll fix the indexes, before they bite my leg off  Applause

nop_90

The size of the db does not matter when it comes to queries, (or matter very little).
What kills u is when u have to update.
Or that was the case on mysql 3

When u update the table it has to lock and then unlock.

chide

InnoDB does row level locking so it's not an issue anymore if you use that table format.  Applause

perkiset

InnoDB, however, cuts into performance pretty hard. The largest reason for InnoDB is if you need transactions. If you are only doing relatively simple DB work, then the default handler is fine. And the table-locking issue only really becomes an issue if you have a high-thrash database where lots of processes are trying to update simultaneously. The lock/release cycle is actually really quick, but it does hold up all other process from writing. Lock/release does not affect reading the table, however, so if you have a high-read/low-write application, the defaults are *very* adequate.

chide

That used to be the case.. InnoDB has gotten a LOT better though:
http://www.mysqlperformanceblog.com/2007/01/08/innodb-vs-myisam-vs-falcon-benchmarks-part-1/

Though YMMV, of course.

Indica

quote author=perkiset link=topic=603.msg4250#msg4250 date=1195345227

If you are only doing relatively simple DB work, then the default handler is fine. And the table-locking issue only really becomes an issue if you have a high-thrash database where lots of processes are trying to update simultaneously.


the majority of the queries are just selecting data and updating one field, nothing too complex. just a lot of select/updates.

chide

Either way.. give it LOTS of cache.  (but not too much that the OS starts swapping it to disk) Applause


Perkiset's Place Home   Politics @ Perkiset's