The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 17, 2019, 11:59:16 PM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: How to optimize script - PHP  (Read 1974 times)
krustee
Rookie
**
Offline Offline

Posts: 33


View Profile
« on: March 21, 2010, 06:43:20 AM »

I have written a script that scrapes websites and checks the amount of proxies on a page. The script is solely for narrowing down a list of urls to add as targets for leeching in future.

The problem is that the script is running out of memory about an hour or so after I run it. Heres the error I'm getting:
Code:
Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to allocate 19923967 bytes) in /home/justin/proxyscraper/seeds/scraper.php on line 13

Now it has been my objective to learn a language such as python that could do this much faster (threads) but at the moment I'm just trying to get something workable with the coding knowledge I have. I think if I try to allocate more memory to the script the same thing will happen as it barely made a dent on the huge list of sources that I have.

Now I'm not much of a coder but basically I can hack something together so it works for me. If anyone has any ideas I'm all ears.

Code:
//Read all the sources to an array
$lines = file("proxysource.txt");
$lines = array_unique($lines);

$success = array();
$num = count($lines);
$num = $num - 1;
$regex = '/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}[:][0-9]{1,5}/';
$i = 0;

//Scrape each page and any source that has more than 10 proxies add to array
while($i <= $num){
$result = scrapepage($lines[$i]);
preg_match_all($regex, $result, $matches);
if (count($matches[0]) >= 10){
array_push($success,$lines[$i]);
echo $lines[$i] . "\n";
}
$i++;
}
$m = count($success);
$m = $m - 1;
$n = 0;

//Save the succesful sites to a file
while($n <= $m){
$myFile = "goodsource.txt";
$fh = fopen($myFile, 'a') or die("can't open file");
fwrite($fh, $success[$n]);
fclose($fh);
$n++;
}
?>
« Last Edit: March 21, 2010, 09:24:12 AM by krustee » Logged
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #1 on: March 21, 2010, 10:03:46 AM »

These kinds of problems always rise from variables you don't empty. So in every loop they gather more data and finally there's too much and it dies. So just check in which variables you add data constantly and empty them whenever you can. I have had to stab many, many classes I found from net because of this very problem.
Logged

I met god and he had nothing to say to me.
krustee
Rookie
**
Offline Offline

Posts: 33


View Profile
« Reply #2 on: March 21, 2010, 09:58:39 PM »

Incase anyone is interested here is the same thing written in (very very) bad python (not threaded), was a long day but I done it.

Code:
def unique(seq):
    keys = {}
    for e in seq:
        keys[e] = 1
    return keys.keys()

proxy_source = open("proxysource.txt", "r")
sources = proxy_source.readlines()
sources = unique(sources)

for source in sources:
    try:
        data = web.grab(source)
    except: # should catch ANY exception
        print "Timeout raised and caught"
    p = re.compile('[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}[:][0-9]{1,5}').findall(data)
    if len(p) >= 10:
        filename = "goodsource.txt"
        FILE = open(filename,"a")
        FILE.write(source)
print 'good source'
    else:
        print 'bad source'
« Last Edit: March 21, 2010, 10:02:19 PM by krustee » Logged
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #3 on: March 22, 2010, 05:23:34 AM »

These kinds of problems always rise from variables you don't empty. So in every loop they gather more data and finally there's too much and it dies. So just check in which variables you add data constantly and empty them whenever you can. I have had to stab many, many classes I found from net because of this very problem.

Agreed, I do it all the time.

array_push($success,$lines[$i]);

maybe empty that array?
Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
krustee
Rookie
**
Offline Offline

Posts: 33


View Profile
« Reply #4 on: March 22, 2010, 06:59:18 AM »

Well i believe I have mission accomplished. Below is the same script above but now written in python and threaded. Have to say I'm quite proud of myself for 24 hours. Enjoy.

Code:
import threading
import web
import time
import Queue
import re

THREAD_NUMBER = 50

def unique(seq):
    keys = {}
    for e in seq:
        keys[e] = 1
    return keys.keys()


proxy_source = open("proxysource.txt", "r")
sources = proxy_source.readlines()
sources = unique(sources)


def check_for_proxies(url):
    """
    takes url as string and checks for proxies
    returns true if >=10
    """
    try:   
        data = web.grab(url)
    except:
        print "Source timed out"
return False
    p = re.compile('[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}[:][0-9]{1,5}').findall(data)
    if len(p) >= 10:
        print 'good source'
filename = "goodsource_2.txt"
        FILE = open(filename,"a")
        FILE.write(url)
        proxies = True
    else:
        print 'bad source'
        proxies = False
    return proxies

class sourcechecker(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        while True:
            source = self.queue.get()
            if check_for_proxies(source):
                 print "[INFO] " + source + " contains proxies"
            self.queue.task_done()

def main():
    start = time.time()

    queue = Queue.Queue()

    for i in xrange(THREAD_NUMBER):
        pc = sourcechecker(queue)
        pc.setDaemon(True)
        pc.start()

    print "[INFO] Will now check source for proxies"

    for source in sources:
        queue.put( source )

    queue.join()
    print "Elapsed Time: %s" % (time.time() - start)

if __name__ == '__main__':
    main()
Logged
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!