The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 19, 2019, 02:47:06 PM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: Something ike array_diff - but faster  (Read 10364 times)
walrus
Rookie
**
Offline Offline

Posts: 46


View Profile
« on: May 29, 2009, 05:28:05 AM »

I have ~1.000.000 urls in a sql table, I'll call this the url blacklist (and it's always growing).

I also have a few thousand urls (<10.000) I have to check against this blacklist at regular time intervals and keep just the ones that aren't already in the blacklist.

Besides from the array_diff php function, I've tried some basic sql queries to do this, but it all takes too long and uses too much cpu imo.

I'm sure there's a smarter way to do this, using less resources, but I'm not good at more complex sql queries.
Logged
isthisthingon
Global Moderator
Lifer
*****
Offline Offline

Posts: 2879



View Profile
« Reply #1 on: May 29, 2009, 09:31:25 AM »

walrus,

There's lots of things you can do here but I don't know how your environment is configured, hardware, etc. but here's a few thoughts initially:

  • Make sure your sql environment has all the RAM it needs (mysql I'm assuming) by configuring it into your instance.
  • Since this large list is always used and growing, consider loading the table or entire database into its own RAM drive.
  • Make sure it's indexed "properly."  This is a common area where performance can be improved easily.  Is this a clustered or heap table?  If it's clustered (meaning one column is physically sorted in ascending or descending order), you might want to look at which column is being sorted.  A heap table involves an additional "hop" to see the data the index is pointing to but this is not a deal breaker for performance.  A clustered index really picks up speed when the clustered column has repeated non-unique values.  Imagine a list of names with 100 rows where the first 50 = "Adam".  If record 51 = "Brian" the index quickly finds it.  Therefore, if you had a clustered index that was physically ordered by domain, for example, and the rest of the url lived in another column you might see some performance gains.
  • Assuming the column is indexed, I'd recommend doing a count(*) since you probably don't want to retrieve anything from the database but just get an answer: does this url exist? yes/no

Try the following pseudo-SQL and see if it helps (again, assuming the column "URL" is indexed):

SELECT COUNT(URL) FROM BlackListedURLs WHERE URL = localURLStringToValidate LIMIT 1

Again, this way you're returning no data.  I forget if the LIMIT keyword is allowed in standard SQL.  I use it all the time in Apex since once the limit is reached (one in your case) the query stops processing immediately.  Even without the LIMIT keyword you should have no problems getting an instant response from a million rows, especially if they're in memory and indexed properly.

Good luck!
Logged

I would love to change the world, but they won't give me the source code.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #2 on: May 29, 2009, 09:56:09 AM »

Perhaps I'm missing it, but I think I hear you saying that you want to ditch things that ARE in the blacklist, no?

If this is the case, a simple update or delete will do the job for you, except for the one teensy problem, which is that AFAIK, you cannot do a delete or update during a joined query in MySQL. Therefore, you'll need to do it as a stored procedure.

The process is to open a cursor that contains the join, then walk the cursor and update the target database with a flag that you'll use either to note that the url is no longer valid, or that it should be deleted. This little job, as a SP will be ridiculously fast and efficient.

I believe this would do the trick:

Code:
PROCEDURE testupdater()
BEGIN

declare keepGoing boolean;
declare thisURL varchar(64);
declare master cursor for select t.url from bhtable bh, targettable t where t.url=bh.url;
declare continue handler for not found set keepGoing = false;

set keepGoing = true;
open master;
fetch master into thisURL;
while keepGoing do
update targettable set active = false where url=thisURL;
fetch master into thisURL;
end while;

END

This would be easiest done in phpMyIDE, but it can be done simply from the SQL command line as well.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
isthisthingon
Global Moderator
Lifer
*****
Offline Offline

Posts: 2879



View Profile
« Reply #3 on: May 29, 2009, 10:44:56 AM »

For Friday fun...

In Apex the code would look like the following:

Code:
public String[] BlackListCompare(String[] smallURLList)
{
    BlackListedURLs__c[] blURLs = [SELECT Id, URL__c FROM BlackListedURLs__c WHERE URL__c IN :smallURLList];

    for(BlackListedURLs__c bl : blURLs)
    {
       smallURLList.remove(bl.URL__c);
    }

    return smallURLList;
}

BlackListedURLs__c is the table with millions of URLs (the __c is just an Apex-ism for custom objects) and smallURLList is the array in memory with under 10,000 items.  The IN clause is optimized and recommended in Apex (Force.com) but a little slower in other SQL environments.  The idea is the same though - you only want to keep a list of items that do NOT exist in the master URL table, if I understand correctly.  The loop above is lean since it only contains URLs that exist in both places - getting leaner with more frequent calls.  For example, with only one db connection open, calling this function twice would result in the loop not even executing since:

Code:
BlackListedURLs__c[] blURLs = [SELECT Id, URL__c FROM BlackListedURLs__c WHERE URL__c IN :smallURLList];

returns 0 rows on the second pass.
Logged

I would love to change the world, but they won't give me the source code.
walrus
Rookie
**
Offline Offline

Posts: 46


View Profile
« Reply #4 on: May 29, 2009, 01:51:37 PM »

Wow, thank you for your suggestions isthisthingon, perkiset. I am looking at stored procedures and/or keeping the big list in ram.

I'm afraid I was a bit unclear in my original post - but it's not a lot different from what I really wanted, my english is not as clear as I would like it to be all the time.

What I actually meant was:

At regular time intervals I get a few thousand urls and I need to check them against the blacklist (this has ~1.000.000 urls).

Of those few thousand I keep only the ones that aren't in the blacklist. <<- this is the part where I need some help
The blacklist does not get modified here.

I process those remaining urls (from the few thousand) and some of them will end up in the blacklist, some of them will not.

Processing the urls involves downloading and/or crawling a part of them and deciding if they will be used for something, or they will end up in the blacklist, so, the bigger the blacklist, the better, because processing useless urls takes time.

I should mention that the mysql server and the php script are on different machines, and I think it's best to do this on the sql server.
Logged
isthisthingon
Global Moderator
Lifer
*****
Offline Offline

Posts: 2879



View Profile
« Reply #5 on: May 29, 2009, 02:19:31 PM »

Then aside from general performance advice, perk's direction is the way to go   I wasn't sure if you had a list of a few thousand sitting in RAM at the client-tier or living in some table in SQL.  I think perk's stored procedure is almost a "copy-paste and you're done" operation.  If you could let us know how it went that would be cool.  I always love problems involving data size, location and performance requirements.

Quote
Is it beer-thirty yet?
Logged

I would love to change the world, but they won't give me the source code.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #6 on: May 29, 2009, 06:08:13 PM »

At regular time intervals I get a few thousand urls and I need to check them against the blacklist (this has ~1.000.000 urls).

Of those few thousand I keep only the ones that aren't in the blacklist. <<- this is the part where I need some help
The blacklist does not get modified here.

This should be a straight up PHP thang.

I am beered up pretty good with family ATM, so I can't write it for you. Frankly, i'm happy IU can respond at all Smiley but perhaps tomorrow I can take shot at this because it's about 10 lines of PHP to do it right and it'll be quicker'n poop through a goose.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #7 on: June 14, 2009, 02:39:54 PM »

OK, back in black.

Immediately, I think I'd do this as a joint StoredProc and PHP thing. I'd create a little stored procedure to test and keep (or dispose of) an URL, and a PHP script to wrap all of my current requests into a single query.

So: my stored procedure would look something like this:

Code:
PROCEDURE testURL(theURL varchar(128))
BEGIN

declare tempInt integer;

select count('x') into tempInt from blacklist where url=theURL;
if tempInt = 0 then
insert into keepers(url) values(theURL);
end if;

END

... then in PHP I'd write a little script that would create one big huge string that looked like this:

Code:
<?php

// assume that $inURLs had the URLs that I am testing...
$todo = array();
foreach(
$inURLS as $url)
$todo[] = "call testURL('$url');";

$todoStr implode(chr(10), $todo);

// Assume that I've created an instance of my dbClass called $db:
$db->multiQuery($todoStr);

?>


IMO this takes advantage of PHP's best features for quickness, as well as letting MySQL be best at what it does as well. On reflection, I don't know if I can imagine many ways of going much faster (all-in of course - if the URLs are already in a DB you can make it go quicker, but I'm imagining the process from soup to nuts).
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
netmktg
Rookie
**
Offline Offline

Posts: 37



View Profile
« Reply #8 on: June 26, 2009, 09:48:45 AM »

I have ~1.000.000 urls in a sql table, I'll call this the url blacklist (and it's always growing).

I also have a few thousand urls (<10.000) I have to check against this blacklist at regular time intervals and keep just the ones that aren't already in the blacklist.


I do the same thing for Blogurls and ran into the same issues... I initially tried sql query with (NOT IN) and that quickly hungup the mysql service.

Looping over each record would be tediously slow (as in really really slow)

I currently do an array_diff for smaller datasets. For larger datasets, I've enclosed the query in an outer While loop which limits  records based on the ID (e.g. ID > XXXX and ID < XXX) and the actual SQL query has a LIMIT 10000 (or whatever). This also has the advantage that the Sql dataset requires less memory.

So, I basically still do array_diff but its on much smaller datasets. So, for example, in your case of 1 million urls, my approach would run thru the entire lot in 100 SQL queries.


Another approach I've been using for a long time is to Blacklist Domains or Hosts instead of Urls. I don't know whether it applies in your case, but if it does, then it really cuts a lot of processing time. When doing the Blacklisting process, I store the New Urls in a named array as Url => Domain and then do an array_diff against the Blacklist_Domains_Array
Logged
lonnyklot
n00b
*
Offline Offline

Posts: 1


View Profile
« Reply #9 on: February 15, 2011, 07:46:56 PM »

I have done some research on the PHP array_diff function lately. Turns out there was a bug that was causing a severe slowdown. I would recommend trying to upgrade to the latest version of PHP and trying just using array_diff again. You can read more about what was broke here
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #10 on: February 16, 2011, 08:59:13 AM »

Welcome, LonnyKlot, thanks for the post.

What version, specifically, did you see this issue with arr dif and when you say, " the most current" are talking simply the major version or minor number?
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!