PHP to download a gz file
Posts: 5627

Back in my day we had 9 planets

« on: October 25, 2008, 08:16:37 PM »

I am trying to automate something, but getting stuck.

It is a text file on a remote server that is actually gzipped

I cant figure out what the hell I am supposed to use to get the damn text out.

specifically this is fantomaster gzipped CSV of the spiderbase. I tried XML but can't get it to play, so I am going to do the csv, and want the gz so it can go a little faster.

im sure you already do this perk Wink

I could eat a bowl of Alphabet Soup and shit a better argument than that.
Olde World Hacker
Posts: 10096

« Reply #1 on: October 25, 2008, 10:49:46 PM »

if it's gzipped, then gunzip it on the command line of a unix prompt.

I pull it ungzipped from Fanto every morning - I don't worry about the download time. Here is the script I use for one of my servers (I pull it down a few different ways at different times) with just my personals replaced. Note that I actually process each line into the DB during the DL, rather than getting the file and then processing the whole thing. The only part that is not entirely obvious is the call to the stored procedure updateSpider - the script for which is posted after the PHP code. If should be reasonably obvious, let me know if I need to explain anything.


#! /usr/local/bin/php

$fantomasterURL '';

$dbHost '';
$dbUser 'username';
$dbPass 'password';

$search = array('"'"\n""\r");
$now date('Y-m-d H:i:s'time());

$db = new dbConnection($dbHost$dbUser$dbPass$dbName);

if ((
$handle fopen($fantomasterURL'r')) === FALSE) die ('Cannot open Fantomaster');

$db->query("replace into shared.sysvars(name, value) values('spider_dlstatus', 'downloading')");

$total 0;
$inserted 0;
$blocked 0;
while (
$thisLine fgets($handle))
        if ((
$total 100) == 0)
$db->query("replace into shared.sysvars(name, value) values('spider_dlstatus', '$total')");

        if (
$parts explode(','$thisLine);
$engine str_replace($search''$parts[0]);
$useragent mysql_escape_string(str_replace($search''$parts[1]));
$address trim(str_replace($search''$parts[3]));
                if (
preg_match('/google/i'$engine)) $address str_replace('#'''$address);

                if ((
substr($address01) <> '#') && ($address ' ') && preg_match('/^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$/'$address))
$db->query("call shared.updateSpider('$address', '$engine', '$useragent')");
                } else {

$now date('Y-m-d H:i:s'time());
$db->query("replace into shared.sysvars(name, value) values('spider_lupdate', '$now')");
$db->query("replace into shared.sysvars(name, value) values('spider_inserted', '$inserted')");
$db->query("replace into shared.sysvars(name, value) values('spider_blocked', '$blocked')");
$db->query("replace into shared.sysvars(name, value) values('spider_dlstatus', 'idle')");


PROCEDURE updateSpider(addr char(16), eng varchar(255), ua varchar(255))

   declare ipNum integer unsigned;
   declare dummy integer;
   declare needRec integer;
   declare spiderID integer;
   declare engineID integer;
   declare uaID integer;

   declare continue handler for 1062 set dummy=1;
   declare continue handler for not found set needRec=1;
   set needRec = 0;
   set ipNum = inet_aton(addr);

   select ip into spiderID from shared.spiders where ip=ipNum;
   if (needRec = 1) then

      insert into shared.engines(caption) values(eng);
      select id into engineID from shared.engines where caption=eng;

      insert into shared.useragents(caption) values(ua);
      select id into uaID from shared.useragents where caption=ua;

      insert into shared.spiders(ip, address, engine_id, ua_id) values(ipNum, addr, engineID, uaID);

   end if;
« Last Edit: October 25, 2008, 10:52:41 PM by perkiset »

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Posts: 5627

Back in my day we had 9 planets

« Reply #2 on: October 26, 2008, 07:37:10 AM »

thanks perk. yea, I just will give up on the GZ. I was just trying to reduce my footprint where I can, but its no biggy. Thanks for the code.

Heh. I didnt know you could do this:
str_replace($search,' ',$content);
where search is an array.

Hey Fanto, if you read this.
A version of the botbase that I know I could use, and probably others could, is just the IPs. Nothing else. I personally don't care about which bot it is and such, at least in 1 of my apps. just a thought.

I could eat a bowl of Alphabet Soup and shit a better argument than that.
