crypt

I've been considering using open proxies with certain parts of my code for a long time. However, I run into lots of timeouts and refused connections when testing my proxies in firefox/charon. I'm guessing the refused connections are caused by too many people using that particular proxy simultaneously, because they can be alleviated with a bit of hammering. Timeouts just happen sometimes, even with good proxies, and the same can be said for refused connections. Bleh pain in the ass...

I guess I'm just over thinking it and need to start off by coding a proxy tester in

php

 , so I can atleast have an up-to-date database of proxies. Anonymity doesn't matter, but country of origin does matter, so I would use a geoIP database with it. I guess what I really need is moral support  Applause just a few suggestions as to how you would do it or if you HAVE done it, in whatever language would help me out alot...

For example:
Where would you obtain proxies? (I currently scrape mine from proxy sites, but I do know that you can scan for them, but haven't ever had success with that)

How would you test in

PHP

 /Curl? (I would assume setting a timeout - not sure which option, but there are a couple timeout options - and the proxy with http tunnel and follow redirect, then hitting my target site and checking for a footprint - we'll say google.com for scraping purposes - even though it's not  :devilApplause

How do you handle timeouts/failed connections encountered once you have "good" proxies? (I assume setting a maximum number of retries, and just hammering it - failing that, mark the proxy as bad in the database)


That's all I can think of right now, I'm sure I'll think of more after I've gotten to the point where I can actually USE the proxies I've scraped. Please hold my hand !  Applause

Thanks

perkiset

quote author=crypt link=topic=293.msg1973#msg1973 date=1181094531

Where would you obtain proxies? (I currently scrape mine from proxy sites, but I do know that you can scan for them, but haven't ever had success with that)

There was a list posted @ syndk8 a bit ago... perhaps a good starting place but I didn't test any of the addresses listed.

quote author=crypt link=topic=293.msg1973#msg1973 date=1181094531

How would you test in

PHP

 /Curl? (I would assume setting a timeout - not sure which option, but there are a couple timeout options - and the proxy with http tunnel and follow redirect, then hitting my target site and checking for a footprint - we'll say google.com for scraping purposes - even though it's not  :devilApplause

If you're just looking for timing, perhaps you should proxy right back to the

mac

 hine that is testing so that the timestamp would be as accurate as possible - also, the URL you query could run a little

PHP

  script letting your <waiting process> know that that the request as come through. A simple GET param would let the script know which process to alert. Or it could be a semaphore file, or semV, or APC cache item, or record in the DB or or or... perhaps you hit it a few dozen times at various times to see what the average turn around on requests is and post that, so you have an idea what to expect when you go outside your own

net

 work.

quote

How do you handle timeouts/failed connections encountered once you have "good" proxies? (I assume setting a maximum number of retries, and just hammering it - failing that, mark the proxy as bad in the database)

I'd score a proxy record in my database as +(an amount) for every good hit and build a running average of response times so that I could preference proxies in the future - obviously it'd get dinged for a fail. I would not list it as "bad" until it had failed on more than one occasion - or perhaps failed several times - because, as you note, even the best will have problems and you should not just toss something without giving it a solid chance. Additionally, I might list a proxy as "suspended" for a couple weeks or something, so that as every other <>spammer anonymous user discovers the proxy is overloaded and leaves, and the load falls away, you might be able to make use of it again. I might even have a spreading timeframe for when proxies are to be rechecked based on failure ie., if they fail for the first time, they'll only be un-preferenced a little... but if several times I might schedule the next attempt on <that record> for a week from now - and if it still failes then, perhaps a month from then.., perhaps after 6 months of repeated failure it'd for reals get the boot (provided I know that the proxy is still alive but I just can't get in).

I dunno... Applause
/p

crypt

Pretty much along the same lines as I'm thinking, and I do like the +/- scoring idea, I use that for a few other things. First thing's first, however, I'm gonna check that list at syndk8  Applause

Thanks so far, not done here yet!

perkiset

Here's a small ist of starters:
http://www.syndk8

.net

 /forum/index.

php

 /topic,9089.0.html

KMBA (MangoPirate) posted this and I think it might merit some of your time:
http://www.syndk8

.net

 /forum/index.

php

 /topic,9516.0.html

That might get you going...

/p

Bompa

Hey crypt,

I did what perk mentioned with the -+ scoring.

I scrape proxy4free, (all those places have the same proxies anyways, some of the free proxy
sites are owned by same person LOL; just marketing), twice a day with a cron so the output
gets emailed to me, (altho I haven't received it in a few days, wtf?).

I put them in a

perl

  DBM and have my script GET google with each proxy.
I check the page Title for 'google':

if($content =~ /<title>Google</title>/i) {

If true, that proxy get a +1, if false it gets -1.

If any proxy reaches -5 total, it is removed from the DB.

This is not a fast way to test tho, I let the sucker run overnight Applause

After 8 hours, the list is quite a bit smaller, but many of the proxies will have
a nice high scorce like 20-50, while most have a score of less than 10.

Anyways, I think of it as a reliability test.  It shows me which proxies I have my
best odds with.

Of course, I am adding "new" proxies everyday.

Bomps

crypt

Right on guys. That's we're all more or less on the same page, I'll get to work on this and let you know if I come up with any cool trix to add to the system.  Applause

thedarkness

I'm interested in this crypt. I've been meaning to check out www.cspy.org, it ap

pear

 s to have some good info. but I'm not at that point in a couple of "master plans" yet. I would have a tendency to use something that gives you access to the timeout on the socket so you have greater control in your testing process. Something along the lines of perk's or Bomps' weighting system for preferred proxies is a cool idea.

Like I said I'm interested, post your progress here crypt and maybe when i'm at that stage I might be able to dive in and give a hand.

quote author=Bompa link=topic=293.msg1981#msg1981 date=1181135430

twice a day with a cron so the output
gets emailed to me, (altho I haven't received it in a few days, wtf?).


lmfao here Bomps  Applause

Cheers,
td

thedarkness

By a strange quirk of fate this thread is active at the syndk8 now;

http://www.syndk8

.net

 /forum/index.

php

 /topic,11579.0.html

Anyone got any Russian contacts?

Cheers,
td

Bompa

TD, I understand you wanting the timeout on the socket, or whatever, but what I found
is that any sort of close tolerence testing of open proxies is useless.  I mean, one minute
a proxy works fine, next minute it's 500 or 400, try again and it works fine.  That's why I
found that a general overall reliability test is the best.  Just to find the proxies that I have
my best chances with at any given time.


Bompa

thedarkness

Understood Bomps, thanks for the heads up.

Cheers,
td

crypt

Ok i've pretty much finished the system, i scrape proxy4free and samair.ru, they both seem to give a good number of working proxies. We'll say the system is in beta where it will probably stay indefinitely. What i do is scrape the sites, then I have a small checking script which I run as a background process like this:
exec('script.

php

  arg1 arg2 > /dev/null 2>&1 &');
I pass it the number of tries (10), the url to check against, etc... I add 1 if it checks ok or subtract 1 from the score if it fails with a maximum of 10 and minimum of 0. I run 50 of these processes from cron every minute in order to keep my checked list updated. Ok now I've got an updated list, next I need my curl calls to have a retry, so I have a function that will retry every connection until I get 200 with a maximum number of attempts and a timeout (usually 10). If the maximum number of attempts is reached, then the next proxy in the list is tried. And it's pretty much that simple, the hard part was changing all my existing code to work like this Applause

I also used a geoIP database to sort my proxies by country. That's pretty simple to setup as well, and is very useful...

perkiset

Well done - sounds well thought out, and clearly makes best use of the OS to do the threading for you...

thedarkness

Very cool crypt.

BTW, there is a fork function for

php

  just so everyone is aware.

Cheers,
td

perkiset

quote author=thedarkness link=topic=293.msg2187#msg2187 date=1181886036

BTW, there is a fork function for

php

  just so everyone is aware.


Tis... however, caveat programmetor: you must only call it when executing

PHP

  from a shell and never from an HTTP call

/p

JasonD

Just a heads up to an old old old couple of apps I love and use extensively.


Yet Another Proxy Hunter - http://yaph.sourceforge

.net

 /

and it's sister application which I think is one of the most beautiful things I have ever come across

ProxyChains - http://proxychains.sourceforge

.net

 /

Write your code like normal and make sure that all your *ahem - dubious* traffic goes is on a seperate su

bnet

  and then make sure that the router, routes through Proxychains.

Also make sure you run YAPH through Proxychains Applause

JasonD

O BTW I will need some beta testers soon for proxy related stuff.

Essentially they will need to both have traffic and use proxies.

'nuff said  and Mr. P please feel free to delete this if "out of order"

perkiset

Applause Define "order" my friend.

No worries

/p

JasonD

Let me reword it.

"Please delete if it crosses any line you or others have set for this place"


Perkiset's Place Home   Politics @ Perkiset's