perkiset

Nutballs asserted on the

ASP

  board that a numeric index would be faster than a string index – particularly for looking up IP addresses. His findings as relates to actual practice notwithstanding, I wanted to do a pure test of the indicies in MySQL. So I took my list of 500,000 pseudo addresses from my

PHP

  Extension in C thread and added them to a table that looked like this:
CREATE TABLE ipseektest' (
  'ipstring' varchar(16) NOT NULL,
  'ipnum' bigint(20) NOT NULL,
  KEY 'ipstring ('ipstring'),
  KEY 'ipnum' ('ipnum')
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
[/pre]
...and then did an insert of all 500,000 records as both a string and a numeric. Then I wrote code to load up the addresses, shuffle them, and take the first 100,000 and do a select * from … query against both the string value and the numeric value. I made sure that the queries were against identical IP addresses just to ensure that small differences in index position would not be a factor. The code for the test looks like this:

<?

php

 

require_once('/www/sites/lib/classes/class.dbconnection.

php

 ');
$start = mtime();

print " ";

$rawBuff = file_get_contents('./addresses.imp');
elapsed("Load Imploded");
$ipStrings = explode(chr(10), $rawBuff);
elapsed("Process Imploded - count=" . count($ipStrings));

$db = new dbConnection('127.0.0.1', 'testuser', 'testpass', 'test');

print "Preparing array ";
shuffle($ipStrings);
$ipNums = array_map("ip2num", $ipStrings);
elapsed("Complete");

print "starting string query ";
for ($i=0; $i<100000; $i++)
{
$db->query("select * from ipseektest where ipstring='{$ipStrings[$i]}'");
if ($i % 1000 == 0) echo '.';
}
elapsed('Complete');

print "starting numeric query ";
for ($i=0; $i<100000; $i++)
{
$db->query("select * from ipseektest where ipnum={$ipNums[$i]}");
if ($i % 1000 == 0) echo '.';
}
elapsed('Complete');


function elapsed($msg)
{
global $start;
$elap = mtime() - $start;
echo "$msg: $elap ";
$start = mtime();
}
function mtime()
{
list($usec, $sec) = explode(' ', microtime());
return ((float)$usec + (float)$sec);
}

?>


Note that I use my own dbConnection class – which is a simple abstraction layer and in the case of the query, the method calls is going directly against the

PHP

  mysql functions so there is no impact at all from my class.

The results were pretty clear:

Load Imploded: 0.0285549163818
Process Imploded - count=500000: 0.310570001602
Preparing array
Complete: 1.33451318741
starting string query
. . . . . . . . . . . . [clipped for readability]
Complete: 35.0995740891

starting numeric query
. . . . . . . . . . . . [clipped for readability]
Complete: 24.0895760059
[/pre]

The numeric query only took 68.6% time that the string query did – a benefit of almost a full third in performance.

In MySQL I had to use a BigInt type which is 20 bytes since a normal Int was not big enough. This leads to a

net

  increase in storage of 8 Megs per million records (4 bytes per record and 4 bytes for the index) which, in the grand scheme of things is rather trivial IMO.

So – Perks' new plan: Use my custom ip2num and num2ip functions to do conversions and store IP addresses as numerics.

Thanks NBs!

StephenBauer


What if you use the four byte sized columns as per NBs best performance spec.  Primary keys in MySQL are clustered, no?  A speed and DB sizing benefit in one.  Throw in the binary shifting and you may have the best method possible.  Cannot recall off the top of my head if

PHP

  does binary shifting as an operator...probably does.  I know C derivitive (sp?) languages support it.

SB

perkiset

quote author=StephenBauer link=topic=314.msg2114#msg2114 date=1181771985

What if you use the four byte sized columns as per NBs best performance spec.  Primary keys in MySQL are clustered, no?  A speed and DB sizing benefit in one. 

Was going to play with that one, but PinkHat is on me because I've had too much fun for the last 2 days on this schtuff and have managed to avoid *actual* work...  Applause it's on the list though...

quote author=StephenBauer link=topic=314.msg2114#msg2114 date=1181771985

Throw in the binary shifting and you may have the best method possible.  Cannot recall off the top of my head if

PHP

  does binary shifting as an operator...probably does.  I know C derivitive (sp?) languages support it.

PHP

  does have bitwise operators ie., $c = $a << $b would say $c gets the value of $a, shifted left $b bits. Great for multiplications and division... but I'm afraid I'm not getting you here... please clarify so I don't look like a dork...   Applause

StephenBauer


I wasn't thinking too "specific" when I was typing that last reply...the "shifting" was in relation to this part of NB's code (all the *256 stuff):

num=cint(str1)*256*256*256+cint(str2)*256*256+cint(str3)*256+cint(str4)

But that wouldn't play into the code we're talking about.  Applause  Sorry.  Sometimes you just start typing away!

SB

perkiset

Gotcha, makes sense - thanks!

nutballs

read it at syndk8 first, but like SB suggests, when you have time, try out the 4 tinyint column method. I have a feeling that will smoke the ip2num method. especially because it is just a split instead of some semi-complex math.

perkiset

quote author=nutballs link=topic=314.msg2119#msg2119 date=1181776967

when you have time, try out the 4 tinyint column method. I have a feeling that will smoke the ip2num method. especially because it is just a split instead of some semi-complex math.


Applause Complex math? the ip2num method is about as uncalculated as can be... and I'm looking forward to see if an index on 4 columns is actually faster than a single numeric index... I'm gonna bet a shiny new nickel that the numeric version is the fastest and the 4 column is a rough tie.

I'll let you know when I get a moment on it...
/p

nutballs

I guess semi-complex was the wrong way to put it.

Multi-step math. At the database level, i also am guessing a tie, but...

the real use of this would be real IPs that would have to be converted to be looked up, so whatever system you use to do the conversion, would be the bottleneck. Im guessing a string split with 4 resulting parts of a string max length of 15 characters is going to be faster than that multi-step math shit thats going on in the IP2num conversion. but i really dont know, other than my test which resulted in 4column smokin everything else.

curious about the pure DB level though.

perkiset

And the results are in!

In my final test, I changed the database thusly:
CREATE TABLE `ipseektest` (
  `ipstring` varchar(16) NOT NULL,
  `ipnum` bigint(20) NOT NULL,
  `tiny1` smallint(6) NOT NULL,
  `tiny2` smallint(6) NOT NULL,
  `tiny3` smallint(6) NOT NULL,
  `tiny4` smallint(6) NOT NULL,
  KEY `ipstring` (`ipstring`),
  KEY `ipnum` (`ipnum`),
  KEY `tiny1` (`tiny1`,`tiny2`,`tiny3`,`tiny4`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
[/pre]

Note the addition of 4 new fields, tiny1-4 each of which are SMALLINTs. I could not use a TINYINT in MySQL because that is a signed 4 bytes – which should be plenty large enough, but I could only put -127 – 127 in it. Very peculiar. A Small is only 6 bytes, so the difference was pretty small.

Then I took the exact same addresses and placed them into the octet fields of the table ie., if the ipstring was 127.0.0.1, then the ipnum is 127000000001 and tiny1 is 127, tiny2 is 0 tiny3 is 0 and tiny4 is 1. I then indexed as you can see above.

The code I ran is as follows:

#! /usr/local/bin/

php

 
<?

php

 

require_once('/www/sites/lib/classes/class.dbconnection.

php

 ');
$start = mtime();

print " ";


// JUST GETTING THE FILE OF ADDRESSES
$rawBuff = file_get_contents('./addresses.imp');
elapsed("Load Imploded");
$ipStrings = explode(chr(10), $rawBuff);
elapsed("Process Imploded - count=" . count($ipStrings));

$db = new dbConnection('127.0.0.1', 'testuser', 'testpass', 'test');


// RANDOMIZING AND CREATING A LOOKUP LIST
print "Preparing array ";
shuffle($ipStrings);
$ints = array();
$ptr = 0;
foreach($ipStrings as $address)
{
preg_match('/([0-9]{1,3}).([0-9]{1,3}).([0-9]{1,3}).([0-9]{1,3})/', $address, $octets);
$ints[$ptr][0] = $octets[1];
$ints[$ptr][1] = $octets[2];
$ints[$ptr][2] = $octets[3];
$ints[$ptr][3] = $octets[4];
$ptr++;
if ($ptr > 100000) { break; }
}
elapsed("Complete");


// HERE'S THE JUICE
print "starting string query ";
for ($i=0; $i<100000; $i++)
{
// Line broken here for readability - wasn't in the real code
$db->query("select * from ipseektest where tiny1={$ints[$i][0]} and tiny2={$ints[$i][1]} and
tiny3={$ints[$i][2]} and tiny4={$ints[$i][3]}");

if ($i % 1000 == 0) echo '.';
}
elapsed('Complete');




function elapsed($msg)
{
global $start;
$elap = mtime() - $start;
echo "$msg: $elap ";
$start = mtime();
}
function mtime()
{
list($usec, $sec) = explode(' ', microtime());
return ((float)$usec + (float)$sec);
}

?>


And the result is…

Load Imploded: 0.0286769866943
Process Imploded - count=500000: 0.307203054428
Preparing array
Complete: 1.24323010445
starting string query
. . . . . . . . . [ clipped for readability ]
Complete: 34.0186672211
[/pre]
Just a smidge faster than the string query – about 3.1% faster – hard to make a case that it's worth the effort or extra storage space (we’re now up to 16Megs per million records for storage).

Your point in the other thread is really important: how is this used in the real world? That's why I pushed after the C conversion solution - there just isn't a pretty and scripted way to handle that conversion really efficiently. Using the C addon to

PHP

  we can eliminate that bottleneck and busy work and let the DB get at what it does best, IMO.

/p

nutballs

you should be able to use tinyint in mysql, just set it to unsigned?

In MSsql, tiny int is unsigned only, 1 byte. So space wise it would be a total of 4, making it half of a single bigint. string would be 15bytes. padded number though is <>INT since 255255255255 is well under the 2.1 billion of INT. so thats 4 bytes also. bigint since its 255billion, so thats 8 bytes.

Obviously the indexing would add space as well, but im not sure how much for tinyint if it would even make any difference, which i dont htink it would.

The other factor is readability. Although I know your trying to find the fastest result, regardless of all else, database readability suffers greatly with converted columns. And if the speed difference is 0, i always go with the more human-readable method. but again, thats personal methodology.

just to make sure, you ran it a couple of times right? just to make sure there was no change in serverload. I ran mine all in 1 single pass, each type 1 right after the other, and got a few spikes. but thats because i was running it on a live server. just making sure.

edit:was being retarded with my decimals...

nutballs

BTW, regarding Indexing. 2 lookups are required for indexing obviously, so... im guessing Un-Indexed, tiny will win, possibly even win overall.

StephenBauer

I think mySQL primary keys are inherently (sp?) clustered, no?  And that is default behaviour for SQL Server primary keys too.  Although you can make it non-clustered and create a clustered index later that isn't necessarily the primary key (at least in SQL Server).  Clustered means the data is stored right there in the clustered index so it makes for quick access of the indexed data and you do not get any extra overhead (cpu time or storage needs) for (seperate, non-clustered) index storage.  The (referenced) data is the clustered index or vice versa.

SB

Added:  This knowledge isn't directed at perk or NB in particular as they probably know this but those following along that do not...

perkiset

quote author=nutballs link=topic=314.msg2129#msg2129 date=1181783326

The other factor is readability. Although I know your trying to find the fastest result, regardless of all else, database readability suffers greatly with converted columns. And if the speed difference is 0, i always go with the more human-readable method. but again, thats personal methodology.

just to make sure, you ran it a couple of times right? just to make sure there was no change in serverload. I ran mine all in 1 single pass, each type 1 right after the other, and got a few spikes. but thats because i was running it on a live server. just making sure.


@ Readability - I suppose an important point to make here is that I am not yet suggesting a recommended protocol... I'm just on the hunt for speed right now cause you piqued my interested you bastard. I agree that looking at numbers in the 100 billion range begins to look a bit weird... but if I don't spend too much time looking at them and the speed difference is in the 10x zone, it may be worth it.

@ multipass - yup - a whole bunch of times. About 10 times, the ones I posted were nicely representative of the average.

galileo


May be of some help:

http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html

#

INET

 _ATON(expr)

Given the dotted-quad representation of a

net

 work address as a string, returns an integer that represents the numeric value of the address. Addresses may be 4- or 8-byte addresses.

mysql> SELECT

INET

 _ATON('209.207.224.40');
        -> 3520061480

The generated number is always in

net

 work byte order. For the example just shown, the number is calculated as 209×2563 + 207×2562 + 224×256 + 40.

INET

 _ATON() also understands short-form IP addresses:

mysql> SELECT

INET

 _ATON('127.0.0.1'),

INET

 _ATON('127.1');
        -> 2130706433, 2130706433

Note: When storing values generated by

INET

 _ATON(), it is recommended that you use an INT UNSIGNED column. If you use a (signed) INT column, values corresponding to IP addresses for which the first octet is greater than 127 cannot be stored correctly. See Section 11.2, “Numeric Types”.

#

INET

 _NTOA(expr)

Given a numeric

net

 work address (4 or 8 byte), returns the dotted-quad representation of the address as a string.

mysql> SELECT

INET

 _NTOA(3520061480);
        -> '209.207.224.40'


Perkiset's Place Home   Politics @ Perkiset's