nutballs

Since the most effective way to store an IP in a database is actually as an Integer (or BigInt technically) this function pair is to convert an IP4 octet to a BigInt and vice versa.

function IP2Num(sip)
dim str1,str2,str3,str4
dim num
IP2Num=0
if isnumeric(left(sip,2)) then
str1=left(sip,instr(sip,"."Applause-1)
sip=mid(sip,instr(sip,"."Applause+1)
str2=left(sip,instr(sip,"."Applause-1)
sip=mid(sip,instr(sip,"."Applause+1)
str3=left(sip,instr(sip,"."Applause-1)
str4=mid(sip,instr(sip,"."Applause+1)
num=cint(str1)*256*256*256+cint(str2)*256*256+cint(str3)*256+cint(str4)
IP2Num = num
end if
end function


function num2ip(ipn)
'converts from the integer representation of the IP to the quad-octet notation
dim a,b,c,d,tmpnum,ipnum
ipnum=cdbl(ipn)
a=int((ipnum/(256^3))) mod 256
b=int((ipnum/(256^2))) mod 256
c=int((ipnum/256)) mod 256
'due to the limit of MOD in

asp

  to 4byte integers we gotta do some funky math here for the last octet
tmpnum=int(ipnum/2) 'divide by 2 and toss the decimal
tmpnum=tmpnum*2 'multiply back up by 2 to get the "even" value
tmpnum=ipnum-tmpnum 'subtract new number from original to get adjuster: 0 or 1
d=((int((ipnum/2)) mod 12Applause*2)+tmpnum  'funky math should only be:  ipnum mod 256
num2ip=a&"."&b&"."&c&"."&d
end function

StephenBauer


I have been meaning to benchmark that (traditional) method against the other method of just left padding out each octet with zeros and then concatenating them all together to form a single number converting it to BigInt.  You wouldn't have to pad out the first octet either.  May be faster in the "converting to" side but probably slower when "converting from" due to the use of more string functions.

Or am I on crack.

SB

nutballs

quote author=StephenBauer link=topic=72.msg2018#msg2018 date=1181260058


I have been meaning to benchmark that (traditional) method against the other method of just left padding out each octet with zeros and then concatenating them all together to form a single number converting it to BigInt.  You wouldn't have to pad out the first octet either.  May be faster in the "converting to" side but probably slower when "converting from" due to the use of more string functions.

Or am I on crack.

SB



the only problem is the first octet, unless you add a preceding number like 1

then something like 12.34.5.123 could be 1012034005123
never thought of that... that might actually work. max would be 1.255 trillion. i would guess though that the math manipulations would be faster than the extra steps of converting to strings, concating together, the converting back to int. hmmm. might be worth a test.

I have always wondered if there would be an advantage to storing each octet as a tiny int. 4 columns. it would be only 4 bytes, but the lookup would be 4 columns equal query.

StephenBauer

quote author=nutballs link=topic=72.msg2024#msg2024 date=1181268511

quote author=StephenBauer link=topic=72.msg2018#msg2018 date=1181260058


I have been meaning to benchmark that (traditional) method against the other method of just left padding out each octet with zeros and then concatenating them all together to form a single number converting it to BigInt.  You wouldn't have to pad out the first octet either.  May be faster in the "converting to" side but probably slower when "converting from" due to the use of more string functions.

Or am I on crack.

SB



the only problem is the first octet, unless you add a preceding number like 1

then something like 12.34.5.123 could be 1012034005123
never thought of that... that might actually work. max would be 1.255 trillion. i would guess though that the math manipulations would be faster than the extra steps of converting to strings, concating together, the converting back to int. hmmm. might be worth a test.

I have always wondered if there would be an advantage to storing each octet as a tiny int. 4 columns. it would be only 4 bytes, but the lookup would be 4 columns equal query.


Wouldn't need to add anything to the first octet as long as the remaining octets are padded out.  But, yeah, decoding would probably take longer than the other method that is all math.  Encoding would be a toss up with all the string functions in both methods.

The four integers, composite key method is interesting too.  It could have some performance improvements for particular octet, range oriented queries.

SB

StephenBauer

For the original conversion to integer method, all the multiplication could be replaced with left bit shifting to maybe speed things up a bit (by 24, 16, 8, and none per octet, left to right).

SB

StephenBauer


Regex

 .Split could be useful to test to.

No, not obsessed...will be writing some new logging code soon to mate up with some voting code.  Applause

SB

perkiset

Actually I think SB is onto something - but I don't know if you can do it in your language - it's certainly what I miss in

PHP

 .

Just do it C-string style ie., create a string that is 13 characters long, like this: 1000000000000 (the 1 on the end handles the numeric problem you pointed out NBs) then drop the string values of each char of the octets into the appropriate positions in the new string with pointers. No string or numeric manipulation at all. When tearing it back apart, pull the components back out using direct string pointers and drop them into a new string. The loops would be small and about as tight as a loop can be... and since it'd be all pointer stuff with no calculation I should think it would scream.

Bummer I can't try that in

PHP

  - although I've been toying with writing some of my own language extensions for

PHP

  in C++ ... (yeah right... like I have time for that now  :- ) that'd make a killer fun exercise...

/p

StephenBauer


Unless I am brain dead, what is the "numeric problem nutballs pointed out" with the "1"?  The first octet wouldn't need padding unless you employ the method of string pointer substitution you just brought up and then you need something as a placeholder for each possible digit position of each octet (i.e. twelve zeroes).  The remaining three octets, always being padded, ensure the variable digit length of the first octet do not create duplicate numbers...unless I am oversimplifying in my head.

SB

perkiset

Nope you're right... if you pad the B, C & D octets then stringwise you wouldn't care what the A octet was... the bigint resulting number would store just fine in the DB. Wouldn't need the 1 after all...

nutballs

your right. i was over thinking it.

though I habitually add a 1 in front of any string based numbers that i want to guarantee the string length, just in case I am doing something that results in leading zeros.

The situation where this always comes up for me is in reports. you would want all numbers in a column to display the same length, so as not to cause visual distractions in the column. Its more of a display related problem. I didnt think about the fact this will never be displayed, just queried.

nutballs

so now. the logic would be.

split IP into and array of 4 octets
check string length
  if short, add zeros to fill
concat

i really am wondering which would be the fastest?
string 0 padding
math conversion (though in

ASP

  this might now be the worst because of the INT limit of mod)
4 keys in the database (possible because of being such short datatypes)
string pointers, which i think might be essentially the same task as string padding using length checks, since you would still need to know character positions.

My hunch is actually string padding, or 4 keys.

since i do so many IP lookups, this is probably a good thing for refinement.

perkiset

quote author=nutballs link=topic=72.msg2039#msg2039 date=1181318278

split IP into and array of 4 octets
check string length
   if short, add zeros to fill
concat


No no, C string style ie., string arrays not strings - I am shit anymore with C++ on the fly, so here's some pseudo object pascal code so you can read the idea (in object pascal, you can address string positions as an array, much like you address chars in a string in C - although with C I'd use pointers which are faster than arrays) - but also, that was my question above - can you address strings in a C style in your language?


// strings are zero indexed, just like C - so get the actual array position of the last char
inBuff = '192.13.1.34';
inPtr = strlen(inBuff) - 1;

// in C I'd need an actual Chr(0) at the end to terminate this string
outBuff = '000000000000';
// Last char position in the outbuff...
outPtr = 11;

// I'll use this var * 3 in the routine to point to "jump spots" where I can
// reposition the output pointer past any needed padding
octet = 3;

// Loop so long as there is something in the input buffer to copy...
while (inPtr >= 0)
{
// Copy the char at position inPtr into outBuff at position outPtr
outBuff[outPtr--] = inBuff[inPtr--];

if (inBuff[inPtr] == '.')
{
// We've hit a period in the input - reposition the output and jump over the period...
outPtr = octet-- * 3;
inPtr--;
}
}

// At this point, outBuff contains my final string representation of the number...



functionally, this is little more than a reverse StrCpy function, just skipping a bit for the pads and the periods. When you posted it to the database, any leading zeros left on the string would be dumped by the SQL compiler. Taking it apart would pretty much be just as easy.

perkiset

Oh fishit it.

Just downloaded the entire

tutor

 ial on how to write custom extensions for

PHP

  in C++... thanks boys, like I didn't have enough to do  Applause

I'll post a sister thread in

PHP

  referencing this if I ever actually do something with it Applause

StephenBauer


Hehe.  You are a true hack!  Applause

As for the four columns in a database (one per octet), like I said, it may help in range reporting and it also halves your storage from 8 bytes for a long/bigint to 4 bytes for four bytes (duh Applause ).

I think all four methods, each in its own independent loop in the same program, would yield profile results each showing a percentage of run time...whichever is lowest is the quickest.  Of course you will want to add a database store and retrieve as well.

Report back, nutballs!  Applause

SB

nutballs

testing done.
4 loops.
random IPs checked, each of which occur in the database, so no negative hits.
each cycle uses the same ips for each type of query.
100 at a time
1 million in DB
same table for all methods
ipstring is a string lookup
ipnum is my method
ippadded is SBs method
ipssplit is each octet in a separate column
nothing done with query results.

                                                                                            Average
ipstring:        36   31   32   30   31   29   30   31   31.25
ipnum:         31  29 38 33 27 30 27 34 31.125
ippadded: 49 37 38 40 39 38 36 41 39.75
ipsplit:   26   28   26   26   24   24   23   25   25.25


So here is the surprise... As a string, or as a bigint number, the lookup is the same.... wtf? is the MOD math all that difficult? i guess so...
padding is the slowest, sorry SB.
but here is the other surprise. 4 column lookup is the fastest by about 20% over the ipnum method.
Im gonna try making it 10million rows and lookup 1k at a time. see that the numbers stay the same.

Each loop is exactly the same, except for the conversions. here is a code fragment of the important stuff for review.


'======= ipstring
t=now
for i = 0 to ubound(arr)
sql="select top 1 * from ips where ipstring='"&arr(i)&"'"
db.execute sql
next
response.write "ipstring: "&datediff("s",t,now)&"<br>"

'======= ipnum
t=now
for i = 0 to ubound(arr)
sql="select top 1 * from ips where ipnum="&ip2num(arr(i))
db.execute sql
next
response.write "ipnum: "&datediff("s",t,now)&"<br>"

'======= ipPadded
t=now
for i = 0 to ubound(arr)
sql="select top 1 * from ips where ippadded="&padded(arr(i))
db.execute sql
next
response.write "ippadded: "&datediff("s",t,now)&"<br>"

'======= ipsplit
dim A
t=now
for i = 0 to ubound(arr)
a=split(arr(i),"."Applause
sql="select top 1 * from ips where ip1="&a(0)&" and ip2="&a(1)&" and ip3="&a(2)&" and ip4="&a(3)
db.execute sql
next
response.write "ipsplit: "&datediff("s",t,now)&"<br>"


function padded(ip)
dim iparr,ippadded,k
iparr=split(ip,"."Applause
ippadded=Right("00" & CStr(iparr(0)),3)&Right("00" & CStr(iparr(0)),3)&Right("00" & CStr(iparr(0)),3)&Right("00" & CStr(iparr(0)),3)
padded=ippadded
end function

function IP2Num(tsip)
dim str1,str2,str3,str4,sip
dim num
'IP2Num=0
sip=tsip
if isnumeric(left(sip,2)) then
str1=left(sip,instr(sip,"."Applause-1)
sip=mid(sip,instr(sip,"."Applause+1)
str2=left(sip,instr(sip,"."Applause-1)
sip=mid(sip,instr(sip,"."Applause+1)
str3=left(sip,instr(sip,"."Applause-1)
str4=mid(sip,instr(sip,"."Applause+1)
num=cint(str1)*256*256*256+cint(str2)*256*256+cint(str3)*256+cint(str4)
IP2Num = num
end if
end function

perkiset

NBs I guess I was a bit confused... I thought the test was more about the speed of the indexes than the code... in the example you give, the queries will be completely dependent on your function calls, which may be compiling into pretty expensive code. There's a lot of string code that's being called for every query, if I read that correctly.

Could you throw a test out that did something like this:

Pre-randomize all of the IPs you are going to look for ie., create an array of either 1M strings or 1M numerics. Then simply walk that array and look them up, doing no functional code inbetween each lookup - then we'll know the pure difference between a string and a numeric based index - which is what I thought the original notion was.

Then we'd know if it even makes sense to pursue better functions for converting that string into a number.

nutballs

the test i ran though is relevant to how you would actually use these lookups. I just ran in loops of 1000 so that I could get larger discrepancies to measure. This also ONLY tests lookups, which is what 90% of the usage of this would be for anyway. storage is a heavy hit no matter how you slice it, heavier even with indexes. But in the end the relevance is DB querytime plus manipulation time in code to prepare for the lookup. This is only for Classic

ASP

  mindo you. It might be different for other langs. Another test to try would be the same exact thing but using stored procs. so something like a straight query such as "select top 1 * from table where ipnum=sp_ip2num('207.123.234.12')" which would give a more pure result, regardless of the interface language.

the records in SQL are not indexed, though i was planning on running them indexed as well.
For the Padded method, i tried a couple different manipulations, and the one i ran with performed the best it seemed. thats the right("00"&cstr(ip1),3) part obviously. i tried for loops, i tried a string of IF statements, though i dont know if i tried a select/case statement... hmm. i'll try that too. oh, and i guess i do have 2 of them, not being delivered via function. i will functionify those as well and post updated results.

every test is exactly the same, except for the method of lookup
t=now    '===set a start time
for i = 0 to ubound(arr) '===loop the entire array of IPs that I have pre-created for testing.
sql="select top 1 * from ips where ipnum="&ip2num(arr(i)) '===build the SQL query
db.execute sql '===run it, but return nothing
next
response.write "ipnum: "&datediff("s",t,now)&"<br>" '===write out the seconds of change.


nutballs

ok, update. and this is now weird.

I have made it so each test is as close to the same as possible, with the only variance being the method of lookup. each is called from a function to make sure the function call wasnt adding extra baggage to the process. I also tested all 3 padding methods i could think of.

I even made the query only request the number 1, so that row length doesnt even figure in.

so here are the results. Averages of 5 runs with 50 lookup each.


                          average seconds      pct faster then next up
split octets:          28                          30%
string:                36.4                        4%
numeric:              38                          26%
padded Caseloop:  48                          1%
padded ifloop:      48.4                        7%
padded right:        51.8                        0%

Split 4 column octets works the fastest it seems.
but frankly im baffled. 2.4 million records, no indexing, and string based lookup is faster than numeric?!?!@??

functions for reference

function ipsplit(ip)
dim a,sql
a=split(ip,"."Applause
sql="select top 1 '1' from ips where ip1="&a(0)&" and ip2="&a(1)&" and ip3="&a(2)&" and ip4="&a(3)
ipsplit=sql
end function

function ipstring(ip)
dim sql
sql="select top 1 '1' from ips where ipstring='"&ip&"'"
ipstring=sql
end function

function ippaddedright(ip)
dim iparr,sql
iparr=split(ip,"."Applause
sql="select top 1 '1' from ips where ippadded="&Right("00" & iparr(0),3)&Right("00" & iparr(0),3)&Right("00" & iparr(0),3)&Right("00" & iparr(0),3)
ippaddedright=sql
end function

function ippaddedcase(ip)
dim iparr,sql,octet,pad
iparr=split(ip,"."Applause
pad=""
for each octet in iparr
select case len(octet)
case 1 pad=pad&"00"&octet
case 2 pad=pad&"0"&octet
case 3 pad=pad&octet
end select
next
sql="select top 1 '1' from ips where ippadded="&pad
ippaddedcase=sql
end function

function ippaddedif(ip)
dim iparr,sql,octet,pad
iparr=split(ip,"."Applause
pad=""
for each octet in iparr
if len(octet)=1 then
pad=pad&"00"&octet
elseif len(octet)=2 then
pad=pad&"0"&octet
else
pad=pad&octet
end if
next
sql="select top 1 '1' from ips where ippadded="&pad
ippaddedif=sql
end function

function IP2Num(ip)
dim str1,str2,str3,str4,sip,sql
sip=ip
str1=left(sip,instr(sip,"."Applause-1)
sip=mid(sip,instr(sip,"."Applause+1)
str2=left(sip,instr(sip,"."Applause-1)
sip=mid(sip,instr(sip,"."Applause+1)
str3=left(sip,instr(sip,"."Applause-1)
str4=mid(sip,instr(sip,"."Applause+1)
sql = "select top 1 '1' from ips where ipnum="&str1*256*256*256+str2*256*256+str3*256+str4
ip2num=sql
end function

StephenBauer


Sweet dude.  I didn't really mean you had to do it!  Applause  But it is way cool that you did.  This will come in handy a bit down the road and save me some time too.  Applause has been credited to your PayPal account.  Applause

Did you try the math using the binary shift at all?  Applause

SB

nutballs

no i didnt try binary shifting since i couldnt figure out how to do it in

ASP

  (mostly cause i havent bit shifted since college), and didnt really spend much time on the whole thing. only took me about an hour in front of the TV, refining stuff down and such.

im gonna guess though that 4 columns is going to be the best method, in any system, since i have thought about it a bit more. Converting to bigint is going to be the same math in any language, and would be the same steps. Padding also, though maybe certain languages may have faster implementation of IF/Case/For but i doubt it. Where as the 4 column method is a fixed array resulting from a split of the IP, which all languages should do about the same efficiency. The possible values are limited to 255 per column, meaning that little gain should be achieved from indexing i think. but im gonna try it anyway and post those results.

esrun

Maybe consider test flat file and use is_file


Perkiset's Place Home   Politics @ Perkiset's