The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 16, 2019, 09:09:06 PM

Login with username, password and session length


Pages: [1] 2
  Print  
Author Topic: ajax and curl optimized  (Read 5837 times)
serialnoob
Journeyman
***
Offline Offline

Posts: 88


View Profile
« on: December 05, 2009, 06:02:26 PM »

I am into the automation part of my set up using conbination of ajax or refreshes and curl etc.

I have been testing like 30 different combinations for different purposes so it becomes difficult to properly evaluate performances.

Most processes collect data to be stored in db for futher processing (kwd matching, posting, linkbuilding, etc), using a mix of tor and multiple data centers mostly.

Ultimately I hope to put most of these scripts into a separate box running full time, (which I guess most of you do). Hence the perfomance issue, as in how to multiply up to a sustainable limit. (from 10 to 1000 kind of situation!)

Offcourse other parameters come into the equation, but limiting things to curl, perk's webrequest, or HttpRequest, my understanding is that 
when using tor, for instance, I sort of "delegate processes" to multiple ips (by lack of better explaination).

While this is going on (slowly but surely) not much cpu is used, therefore additional scripts can be launched, right?

What I cannot figure out is how to prevent or limit "collision" between scripts, which I assume, are bound to happen.

Any tips or general conclusions on the subject?




Logged

Success consists of going from failure to failure without loss of enthusiasm - Winston Churchill
oldenstylehats
Rookie
**
Offline Offline

Posts: 19


View Profile
« Reply #1 on: December 05, 2009, 06:21:14 PM »

You might want to look into a message/job queuing daemon like Beanstalkd.
Logged

No links in signatures please
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #2 on: December 05, 2009, 06:58:38 PM »

Use a database for a queue, and a single script for dispatching jobs. If only one entity is pulling jobs from the queue and then dispatching "worklets" with a single, specific and already defined ID, you'll never have collision or multiple-worker problems.

Either that, or have lots of threads working and write a stored procedure that locks the table, extracts a single ID, updates that row so that it won't get picked again, then unlocks the table. If all done in a stored procedure it'll go blazing fast.

Since it is a DB, it can be queried from any number of worker boxes as well.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
serialnoob
Journeyman
***
Offline Offline

Posts: 88


View Profile
« Reply #3 on: December 05, 2009, 09:14:22 PM »

Thanks both, I suffer from the "biz turned tech" (or trying) syndrome, so the "scenario" can be difficult to describe at time!
So I will try to describe the particular situation that led my to this post:

Since the initial post, I have had nearly 50000 links checked as "postable" or not (for linking), via curl/tor to out off a 150000 row mysql table, and updated accordingly. Postable implies some regex so while I am there I also check keywords, vicinity, page rank etc. But the subqueries dont seem to be a burden.

It's slow but stable. However the script sends 50 to 75 links to a curl multi thread and comes back typically 60 to 180s later. During this time the cpu remains below 10%.

My concern is that I cannot predict when exactly it comes back, although the solution seems to be rather elegant and discreet if I understand it fully, as it splits the threads amongst the tor ips (10 ATM) as opposed to 1 per ip in a single curl, let alone the time it would take to connect to that single ip.

Perk, it is interesting however that you straight away suggest that it would be a db issue, undoubtedly from experience, but this 150000 row table creates a peak of max 30%. Are you saying that, in a more itensive set up this is precisely where things would go bad?

(had not noticed) Bruce is the man! 
« Last Edit: December 05, 2009, 09:22:20 PM by serialnoob » Logged

Success consists of going from failure to failure without loss of enthusiasm - Winston Churchill
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #4 on: December 05, 2009, 10:03:27 PM »

150K rows is trivial. 30% utilization means that your indexing scheme is faulty. Remember that even in a straight-up B-Tree index (assuming pages of 32 rows), with a million row table, it only takes 16 disk-head movements to find any record. it only takes 3 more for 8 million.

I go to a database because beyond a couple hundred rows that could be loaded into RAM, they are *the most efficient* way of handle/manipulating/searching/working large chunks of data. Far and away. Put simply, if your DB solution on 50K records is slower than a text file based solution, it's got some logical hiccups.

I also don't use multithread in situations like that because (IMO) I'm not as good as the OS at distributing processing need. So my resolution is a single job: be asleep as much as you can while doing one single job then die. This allows the processor of a machine to really manipulate who gets time - and will give your DB enough processor to do its job as well. Even on hugely spiky times when I have lots and lots of jobs humming simultaneously, the problem is saturation of my pipe, not processor.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
serialnoob
Journeyman
***
Offline Offline

Posts: 88


View Profile
« Reply #5 on: December 05, 2009, 11:23:05 PM »

That is where I don't think binary!

db: atm it is :
 PRIMARY on md5(uri), and domain is numeric id on a domain table

If I get this right, I can generally get more efficiency with a structure like this instead:
 PRIMARY on a numerical then adding a $db->singleAnswer type of query to provide the same the uri uniqueness.


multithread: (the logic again)
 so 10 db queries -> 10 curl  is a better processor usage than 1 db query -> 10x10 curl or at least an improvement?


Logged

Success consists of going from failure to failure without loss of enthusiasm - Winston Churchill
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #6 on: December 06, 2009, 12:01:51 PM »

db: atm it is :
 PRIMARY on md5(uri), and domain is numeric id on a domain table
This doesn't tell us much, unless you can specify how you are looking the data up. If you have 10 indicies on a database, but look something up in a way that doesn't make use of them, then you are in the same position as if you had none. But I will say that if you include the md5() function in your query, then you will definitely be adding overhead, because you have to run every row through that function. Hot tip: static, fixed size columns and particularly integers make the best fodder for indicies.


If I get this right, I can generally get more efficiency with a structure like this instead:
 PRIMARY on a numerical then adding a $db->singleAnswer type of query to provide the same the uri uniqueness.
I'm sorry but I don't follow you here...


multithread: (the logic again)
 so 10 db queries -> 10 curl  is a better processor usage than 1 db query -> 10x10 curl or at least an improvement?
If you regularly have 10 queries, then you should be using $db->multiQuery() rather than query ... or perhaps a stored procedure that gets 10 rows of exactly what you want and returns them as a view.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
walrus
Rookie
**
Offline Offline

Posts: 46


View Profile
« Reply #7 on: December 06, 2009, 05:28:13 PM »

I'm using something like this to prevent high loads since I have a lot of scripts that aren't really optimized:

In the crontab I have a "cron_master.php" file that is running every minute. I'm not running any other php scripts directly from cron.

It has a list of php files it needs to run at every "cycle", it looks like this

add_script('crawler.php',10);
add_script('get_more_keywords.php',2);
add_script('do_something_evil.php',10);

The add_script function (script name, number of times to be run)   checks if the first parameter script is already running and/or if it has enough processes in memory (I simply parse the ps command line). In the background, it executes the number of processes (of that particular script) needed to reach the number in the second parameter.

BUT..

Before actually executing anything, it checks the system load. If it's bigger than some value you choose, it sleeps a bit, and checks again, until the load is low enough. This prevents too many scripts running and helps if you're on low end hardware. You could also limit the number of network connections fairly easy, by counting the netstat output rows.

Also, there can be only one instance of the "cron_master" script running, so the system is never overloaded.

What you're describing with the 30% load is waaay to high for that application. Are you using "order by rand()" ? If you are, DON'T!
« Last Edit: December 06, 2009, 05:33:47 PM by walrus » Logged
serialnoob
Journeyman
***
Offline Offline

Posts: 88


View Profile
« Reply #8 on: December 07, 2009, 10:36:29 AM »

Perk: Thanks for your tolerance with my previous post!
I hope this will be better

Here is the logic

a- Scrap links and sentences using vs blog search
b- Scan links to see if a comment is possible (no captcha etc)
c- If so, leave a test comment to check if it is kept and come back later
d- Sort targets as auto or manual depending on above results 

Sentences are tagged and db stored for article voodoo


so, aside from purely kwd and tags, I have

table 1
primary alpha domain
index numeric id
index time (entry date)

table 2
primary md5(uri)
index alpha uri
index numeric id (domain)
index alpha kwd
index numeric status (0 not tested, then incremental)
index time lastquery (last shot)


blogsearch comes back, I clean and regex as much tags, sentences and populate table 1 and 2
insert ignore domain table 1
insert path or update status table 2
select 0 status uri limit 100
curl / tor from c-  above and start again

It could well be that the burden is actually in the tag/sentence analysis as in effect what comes back from curl is quite a few k.
Stored procedures, yes but I am a slow learner!


walrus: I am a total noob with commands so I'll have to check further on "ps", as it could be my missing link for another script as well, thanks!
Logged

Success consists of going from failure to failure without loss of enthusiasm - Winston Churchill
walrus
Rookie
**
Offline Offline

Posts: 46


View Profile
« Reply #9 on: December 07, 2009, 11:07:19 AM »

Add another table "blacklist".

When you've downloaded more than X pages from a domain, and none of them are useful, or it's something like 1 in 20 useful pages /domain, add that domain in the blacklist, and remove it from the other tables, never use it again.
Logged
serialnoob
Journeyman
***
Offline Offline

Posts: 88


View Profile
« Reply #10 on: December 07, 2009, 11:20:17 AM »

Add another table "blacklist".

When you've downloaded more than X pages from a domain, and none of them are useful, or it's something like 1 in 20 useful pages /domain, add that domain in the blacklist, and remove it from the other tables, never use it again.

I do it by updating domain status with -1
But your point might illustrate my lack of sql experience whereby more tables does not  necessarely mean less efficiency
Logged

Success consists of going from failure to failure without loss of enthusiasm - Winston Churchill
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #11 on: December 10, 2009, 04:39:41 PM »

Tag/sentence analysis is not DB intensive though, is it? I thought the trouble was an overloaded database. There is clearly something not right, because if your database usage is as high as you describe, then either your scraper is BLAZINGLY fast (so fast, in fact, that it's more efficient than your DB) or you've got the database involved where it shouldn't be, or you've got queries that are not optimized.

Lookup EXPLAIN to analyze each of your SQL queries first off. Then write some logging code to see what's taking the longest part of your code, and hone in from there.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
tomblack
Rookie
**
Offline Offline

Posts: 34


View Profile
« Reply #12 on: December 11, 2009, 01:30:45 AM »

Then write some logging code to see what's taking the longest part of your code, and hone in from there.

The stopwatch class at php classes is easy to use and I found it useful for tracking bad performing code.  Wink

http://www.phpclasses.org/browse/package/2061.html
Logged
serialnoob
Journeyman
***
Offline Offline

Posts: 88


View Profile
« Reply #13 on: December 12, 2009, 09:20:19 AM »

Thank you all, gone testing and will come back with res
Logged

Success consists of going from failure to failure without loss of enthusiasm - Winston Churchill
serialnoob
Journeyman
***
Offline Offline

Posts: 88


View Profile
« Reply #14 on: March 05, 2010, 10:07:52 AM »

the single line that changes everything: (in curl_multi)

before

$running=null;
do {
    curl_multi_exec($curlHandle,$running);
} while ($running > 0);

cpu 100% peaks




after

$running=null;
do {
    curl_multi_exec($curlHandle,$running);
    usleep(100);
} while ($running > 0);

cpu flat!



It is those 100 microseconds that make the difference, to the extend that one might wonder why it is not built in function in curl with a defaulted say 100milsec option in the first place.

I have actually tried 1 microsecond only, and yes, it starts to show a minor bump in the cpu curve, but it is hardly readable!

Mind you, thanks to all this, I finally made my move to nix! An entire new life has started
« Last Edit: March 05, 2010, 11:53:27 AM by serialnoob » Logged

Success consists of going from failure to failure without loss of enthusiasm - Winston Churchill
Pages: [1] 2
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!