The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 18, 2019, 12:47:19 PM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: IP Address Query Preformance in MySQL - Num vs Str  (Read 6757 times)
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« on: June 13, 2007, 01:07:54 PM »

Nutballs asserted on the ASP board that a numeric index would be faster than a string index – particularly for looking up IP addresses. His findings as relates to actual practice notwithstanding, I wanted to do a pure test of the indicies in MySQL. So I took my list of 500,000 pseudo addresses from my PHP Extension in C thread and added them to a table that looked like this:
CREATE TABLE ipseektest' (
  'ipstring' varchar(16) NOT NULL,
  'ipnum' bigint(20) NOT NULL,
  KEY 'ipstring ('ipstring'),
  KEY 'ipnum' ('ipnum')
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

...and then did an insert of all 500,000 records as both a string and a numeric. Then I wrote code to load up the addresses, shuffle them, and take the first 100,000 and do a select * from … query against both the string value and the numeric value. I made sure that the queries were against identical IP addresses just to ensure that small differences in index position would not be a factor. The code for the test looks like this:
Code:
<?php

require_once('/www/sites/lib/classes/class.dbconnection.php');
$start mtime();

print 
"\n\n\n";

$rawBuff file_get_contents('./addresses.imp');
elapsed("Load Imploded");
$ipStrings explode(chr(10), $rawBuff);
elapsed("Process Imploded - count=" count($ipStrings));

$db = new dbConnection('127.0.0.1''testuser''testpass''test');

print 
"Preparing array\n";
shuffle($ipStrings);
$ipNums array_map("ip2num"$ipStrings);
elapsed("Complete");

print 
"starting string query\n";
for (
$i=0$i<100000$i++)
{
$db->query("select * from ipseektest where ipstring='{$ipStrings[$i]}'");
if ($i 1000 == 0) echo '.';
}
elapsed('Complete');

print 
"starting numeric query\n";
for (
$i=0$i<100000$i++)
{
$db->query("select * from ipseektest where ipnum={$ipNums[$i]}");
if ($i 1000 == 0) echo '.';
}
elapsed('Complete');


function 
elapsed($msg)
{
global $start;
$elap mtime() - $start;
echo "$msg$elap\n";
$start mtime();
}
function 
mtime()
{
list($usec$sec) = explode(' 'microtime());
return ((float)$usec + (float)$sec);
}

?>


Note that I use my own dbConnection class – which is a simple abstraction layer and in the case of the query, the method calls is going directly against the PHP mysql functions so there is no impact at all from my class.

The results were pretty clear:

Load Imploded: 0.0285549163818
Process Imploded - count=500000: 0.310570001602
Preparing array
Complete: 1.33451318741
starting string query
. . . . . . . . . . . . [clipped for readability]
Complete: 35.0995740891

starting numeric query
. . . . . . . . . . . . [clipped for readability]
Complete: 24.0895760059


The numeric query only took 68.6% time that the string query did – a benefit of almost a full third in performance.

In MySQL I had to use a BigInt type which is 20 bytes since a normal Int was not big enough. This leads to a net increase in storage of 8 Megs per million records (4 bytes per record and 4 bytes for the index) which, in the grand scheme of things is rather trivial IMO.

So – Perks' new plan: Use my custom ip2num and num2ip functions to do conversions and store IP addresses as numerics.

Thanks NBs!
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
StephenBauer
Rookie
**
Offline Offline

Posts: 36


View Profile
« Reply #1 on: June 13, 2007, 02:59:45 PM »


What if you use the four byte sized columns as per NBs best performance spec.  Primary keys in MySQL are clustered, no?  A speed and DB sizing benefit in one.  Throw in the binary shifting and you may have the best method possible.  Cannot recall off the top of my head if PHP does binary shifting as an operator...probably does.  I know C derivitive (sp?) languages support it.

SB
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #2 on: June 13, 2007, 03:04:40 PM »

What if you use the four byte sized columns as per NBs best performance spec.  Primary keys in MySQL are clustered, no?  A speed and DB sizing benefit in one. 
Was going to play with that one, but PinkHat is on me because I've had too much fun for the last 2 days on this schtuff and have managed to avoid *actual* work...  Wink it's on the list though...

Throw in the binary shifting and you may have the best method possible.  Cannot recall off the top of my head if PHP does binary shifting as an operator...probably does.  I know C derivitive (sp?) languages support it.
PHP does have bitwise operators ie., $c = $a << $b would say $c gets the value of $a, shifted left $b bits. Great for multiplications and division... but I'm afraid I'm not getting you here... please clarify so I don't look like a dork...   Nerd
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
StephenBauer
Rookie
**
Offline Offline

Posts: 36


View Profile
« Reply #3 on: June 13, 2007, 03:22:09 PM »


I wasn't thinking too "specific" when I was typing that last reply...the "shifting" was in relation to this part of NB's code (all the *256 stuff):

      num=cint(str1)*256*256*256+cint(str2)*256*256+cint(str3)*256+cint(str4)

But that wouldn't play into the code we're talking about.  Smiley  Sorry.  Sometimes you just start typing away!

SB
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #4 on: June 13, 2007, 03:35:43 PM »

Gotcha, makes sense - thanks!
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #5 on: June 13, 2007, 04:22:47 PM »

read it at syndk8 first, but like SB suggests, when you have time, try out the 4 tinyint column method. I have a feeling that will smoke the ip2num method. especially because it is just a split instead of some semi-complex math.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #6 on: June 13, 2007, 04:36:42 PM »

when you have time, try out the 4 tinyint column method. I have a feeling that will smoke the ip2num method. especially because it is just a split instead of some semi-complex math.

 Huh? Complex math? the ip2num method is about as uncalculated as can be... and I'm looking forward to see if an index on 4 columns is actually faster than a single numeric index... I'm gonna bet a shiny new nickel that the numeric version is the fastest and the 4 column is a rough tie.

I'll let you know when I get a moment on it...
/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #7 on: June 13, 2007, 05:21:38 PM »

I guess semi-complex was the wrong way to put it.

Multi-step math. At the database level, i also am guessing a tie, but...

the real use of this would be real IPs that would have to be converted to be looked up, so whatever system you use to do the conversion, would be the bottleneck. Im guessing a string split with 4 resulting parts of a string max length of 15 characters is going to be faster than that multi-step math shit thats going on in the IP2num conversion. but i really dont know, other than my test which resulted in 4column smokin everything else.

curious about the pure DB level though.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #8 on: June 13, 2007, 05:34:04 PM »

And the results are in!

In my final test, I changed the database thusly:
CREATE TABLE `ipseektest` (
  `ipstring` varchar(16) NOT NULL,
  `ipnum` bigint(20) NOT NULL,
  `tiny1` smallint(6) NOT NULL,
  `tiny2` smallint(6) NOT NULL,
  `tiny3` smallint(6) NOT NULL,
  `tiny4` smallint(6) NOT NULL,
  KEY `ipstring` (`ipstring`),
  KEY `ipnum` (`ipnum`),
  KEY `tiny1` (`tiny1`,`tiny2`,`tiny3`,`tiny4`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;


Note the addition of 4 new fields, tiny1-4 each of which are SMALLINTs. I could not use a TINYINT in MySQL because that is a signed 4 bytes – which should be plenty large enough, but I could only put -127 – 127 in it. Very peculiar. A Small is only 6 bytes, so the difference was pretty small.

Then I took the exact same addresses and placed them into the octet fields of the table ie., if the ipstring was 127.0.0.1, then the ipnum is 127000000001 and tiny1 is 127, tiny2 is 0 tiny3 is 0 and tiny4 is 1. I then indexed as you can see above.

The code I ran is as follows:
Code:
#! /usr/local/bin/php
<?php

require_once('/www/sites/lib/classes/class.dbconnection.php');
$start mtime();

print 
"\n\n\n";


// JUST GETTING THE FILE OF ADDRESSES
$rawBuff file_get_contents('./addresses.imp');
elapsed("Load Imploded");
$ipStrings explode(chr(10), $rawBuff);
elapsed("Process Imploded - count=" count($ipStrings));

$db = new dbConnection('127.0.0.1''testuser''testpass''test');


// RANDOMIZING AND CREATING A LOOKUP LIST
print "Preparing array\n";
shuffle($ipStrings);
$ints = array();
$ptr 0;
foreach(
$ipStrings as $address)
{
preg_match('/([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})/'$address$octets);
$ints[$ptr][0] = $octets[1];
$ints[$ptr][1] = $octets[2];
$ints[$ptr][2] = $octets[3];
$ints[$ptr][3] = $octets[4];
$ptr++;
if ($ptr 100000) { break; }
}
elapsed("Complete");


// HERE'S THE JUICE
print "starting string query\n";
for (
$i=0$i<100000$i++)
{
// Line broken here for readability - wasn't in the real code
$db->query("select * from ipseektest where tiny1={$ints[$i][0]} and tiny2={$ints[$i][1]} and 
tiny3=
{$ints[$i][2]} and tiny4={$ints[$i][3]}");

if ($i 1000 == 0) echo '.';
}
elapsed('Complete');




function 
elapsed($msg)
{
global $start;
$elap mtime() - $start;
echo "$msg$elap\n";
$start mtime();
}
function 
mtime()
{
list($usec$sec) = explode(' 'microtime());
return ((float)$usec + (float)$sec);
}

?>


And the result is…

Load Imploded: 0.0286769866943
Process Imploded - count=500000: 0.307203054428
Preparing array
Complete: 1.24323010445
starting string query
. . . . . . . . . [ clipped for readability ]
Complete: 34.0186672211

Just a smidge faster than the string query – about 3.1% faster – hard to make a case that it's worth the effort or extra storage space (we’re now up to 16Megs per million records for storage).

Your point in the other thread is really important: how is this used in the real world? That's why I pushed after the C conversion solution - there just isn't a pretty and scripted way to handle that conversion really efficiently. Using the C addon to PHP we can eliminate that bottleneck and busy work and let the DB get at what it does best, IMO.

/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #9 on: June 13, 2007, 06:08:46 PM »

you should be able to use tinyint in mysql, just set it to unsigned?

In MSsql, tiny int is unsigned only, 1 byte. So space wise it would be a total of 4, making it half of a single bigint. string would be 15bytes. padded number though is INT since 255255255255 is well under the 2.1 billion of INT. so thats 4 bytes also. bigint since its 255billion, so thats 8 bytes.

Obviously the indexing would add space as well, but im not sure how much for tinyint if it would even make any difference, which i dont htink it would.

The other factor is readability. Although I know your trying to find the fastest result, regardless of all else, database readability suffers greatly with converted columns. And if the speed difference is 0, i always go with the more human-readable method. but again, thats personal methodology.

just to make sure, you ran it a couple of times right? just to make sure there was no change in serverload. I ran mine all in 1 single pass, each type 1 right after the other, and got a few spikes. but thats because i was running it on a live server. just making sure.

edit:was being retarded with my decimals...
« Last Edit: June 13, 2007, 07:46:13 PM by nutballs » Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #10 on: June 13, 2007, 06:10:16 PM »

BTW, regarding Indexing. 2 lookups are required for indexing obviously, so... im guessing Un-Indexed, tiny will win, possibly even win overall.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
StephenBauer
Rookie
**
Offline Offline

Posts: 36


View Profile
« Reply #11 on: June 13, 2007, 06:14:19 PM »

I think mySQL primary keys are inherently (sp?) clustered, no?  And that is default behaviour for SQL Server primary keys too.  Although you can make it non-clustered and create a clustered index later that isn't necessarily the primary key (at least in SQL Server).  Clustered means the data is stored right there in the clustered index so it makes for quick access of the indexed data and you do not get any extra overhead (cpu time or storage needs) for (seperate, non-clustered) index storage.  The (referenced) data is the clustered index or vice versa.

SB

Added:  This knowledge isn't directed at perk or NB in particular as they probably know this but those following along that do not...
« Last Edit: June 13, 2007, 08:00:43 PM by StephenBauer » Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #12 on: June 13, 2007, 06:47:16 PM »

The other factor is readability. Although I know your trying to find the fastest result, regardless of all else, database readability suffers greatly with converted columns. And if the speed difference is 0, i always go with the more human-readable method. but again, thats personal methodology.

just to make sure, you ran it a couple of times right? just to make sure there was no change in serverload. I ran mine all in 1 single pass, each type 1 right after the other, and got a few spikes. but thats because i was running it on a live server. just making sure.

@ Readability - I suppose an important point to make here is that I am not yet suggesting a recommended protocol... I'm just on the hunt for speed right now cause you piqued my interested you bastard. I agree that looking at numbers in the 100 billion range begins to look a bit weird... but if I don't spend too much time looking at them and the speed difference is in the 10x zone, it may be worth it.

@ multipass - yup - a whole bunch of times. About 10 times, the ones I posted were nicely representative of the average.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
galileo
n00b
*
Offline Offline

Posts: 4


View Profile
« Reply #13 on: July 20, 2007, 03:13:02 AM »


May be of some help:

http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html

#INET_ATON(expr)

Given the dotted-quad representation of a network address as a string, returns an integer that represents the numeric value of the address. Addresses may be 4- or 8-byte addresses.

mysql> SELECT INET_ATON('209.207.224.40');
        -> 3520061480

The generated number is always in network byte order. For the example just shown, the number is calculated as 209×2563 + 207×2562 + 224×256 + 40.

INET_ATON() also understands short-form IP addresses:

mysql> SELECT INET_ATON('127.0.0.1'), INET_ATON('127.1');
        -> 2130706433, 2130706433

Note: When storing values generated by INET_ATON(), it is recommended that you use an INT UNSIGNED column. If you use a (signed) INT column, values corresponding to IP addresses for which the first octet is greater than 127 cannot be stored correctly. See Section 11.2, “Numeric Types”.

#INET_NTOA(expr)

Given a numeric network address (4 or 8 byte), returns the dotted-quad representation of the address as a string.

mysql> SELECT INET_NTOA(3520061480);
        -> '209.207.224.40'

Logged

No links in signatures please
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!