The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. October 14, 2019, 06:01:22 PM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: Problem trying to scrape a site  (Read 6276 times)
patch
Rookie
**
Offline Offline

Posts: 32


View Profile
« on: July 12, 2011, 11:42:30 AM »

Hi,

I'm trying - unsuccessfully - to scrape data from the racingpost.com site.

e.g. I'm trying to scrape (using curl) the race & runner info from a page like this:

http: // www. racingpost. com/horses2/cards/card.sd?race_id=534879&r_date=2011-07-12

But I can't get any of the info back ... when I echo out the returned result I simply get the race 'info' at the top of the page and the BETTING FORECAST at the bottom of the page, all the runner info is missing.

Obviously I'm c**p at scraping but can anyone point out what I'm doing wrong?

Cheers.
« Last Edit: July 12, 2011, 12:45:58 PM by perkiset » Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #1 on: July 12, 2011, 12:48:50 PM »

Well I was originally (and quickly) going to say that perhaps it's an AJAX site and the racer data is coming down secondarily to the primary page, but that's not the case. When I simply view the source of that page, everything is there, meaning that scraping should be a pretty simple exercise.

How are you gathering the page, and how are you endeavoring to scrape/parse it?
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
patch
Rookie
**
Offline Offline

Posts: 32


View Profile
« Reply #2 on: July 12, 2011, 01:34:17 PM »

Hi,

I'm using a simple script ... which works just fine when I scrape, for example, bbc.co.uk so I'm wondering if it's something about the racing post site or something I'm not doing.

Here's the simple code:

Code:
include('LIB_http.php');
$ref="www.racingpost.com";
$target = "http://www.racingpost.com/horses2/cards/card.sd?race_id=534879&r_date=2011-07-12";
$response = http_get($target, $ref);
$data=$response['FILE'];
echo $data;

the LIB_http.php library contains the http_get function which looks like this:

Code:
function http_get($target, $ref)
    {
    return http($target, $ref, $method="GET", $data_array="", EXCL_HEAD);
    }

the http function looks like this:

Code:
function http($target, $ref, $method, $data_array, $incl_head)
{
    # Initialize PHP/CURL handle
$ch = curl_init();

    # Prcess data, if presented
    if(is_array($data_array))
        {
    # Convert data array into a query string (ie animal=dog&sport=baseball)
        foreach ($data_array as $key => $value)
            {
            if(strlen(trim($value))>0)
                $temp_string[] = $key . "=" . urlencode($value);
            else
                $temp_string[] = $key;
            }
        $query_string = join('&', $temp_string);
        }
       
    # HEAD method configuration
    if($method == HEAD)
        {
    curl_setopt($ch, CURLOPT_HEADER, TRUE);                // No http head
    curl_setopt($ch, CURLOPT_NOBODY, TRUE);                // Return body
        }
    else
        {
        # GET method configuration
        if($method == GET)
            {
            if(isset($query_string))
                $target = $target . "?" . $query_string;
            curl_setopt ($ch, CURLOPT_HTTPGET, TRUE);
            curl_setopt ($ch, CURLOPT_POST, FALSE);
            }
        # POST method configuration
        if($method == POST)
            {
            if(isset($query_string))
                curl_setopt ($ch, CURLOPT_POSTFIELDS, $query_string);
            curl_setopt ($ch, CURLOPT_POST, TRUE);
            curl_setopt ($ch, CURLOPT_HTTPGET, FALSE);
            }
    curl_setopt($ch, CURLOPT_HEADER, $incl_head);   // Include head as needed
    curl_setopt($ch, CURLOPT_NOBODY, FALSE);        // Return body
        }
       
curl_setopt($ch, CURLOPT_COOKIEJAR, COOKIE_FILE);   // Cookie management.
curl_setopt($ch, CURLOPT_COOKIEFILE, COOKIE_FILE);
curl_setopt($ch, CURLOPT_TIMEOUT, CURL_TIMEOUT);    // Timeout
curl_setopt($ch, CURLOPT_USERAGENT, WEBBOT_NAME);   // Webbot name
curl_setopt($ch, CURLOPT_URL, $target);             // Target site
curl_setopt($ch, CURLOPT_REFERER, $ref);            // Referer value
curl_setopt($ch, CURLOPT_VERBOSE, FALSE);           // Minimize logs
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);    // No certificate
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);     // Follow redirects
curl_setopt($ch, CURLOPT_MAXREDIRS, 4);             // Limit redirections to four
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);     // Return in string
   
    # Create return array
    $return_array['FILE']   = curl_exec($ch);
    $return_array['STATUS'] = curl_getinfo($ch);
    $return_array['ERROR']  = curl_error($ch);
   
    # Close PHP/CURL handle
  curl_close($ch);
   
    # Return results
  return $return_array;
    }

Thanks for looking.
Logged
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #3 on: July 12, 2011, 05:12:28 PM »

Hi patch,

I'm into perl, not php, but I just got and echo'd that whole page with this
code on my pc running WAMP.


Code:
<?php
  $url 
"http://www.racingpost.com/horses2/cards/card.sd?race_id=534879&r_date=2011-07-12";
  
$str file_get_contents($url);
  echo 
"$str"
?>



Some say that wont work on all hosts, I dunno.  I had to look up the code
cuz I never use php.

http://www.howtogeek.com/howto/programming/php-get-the-contents-of-a-web-page-rss-feed-or-xml-file-into-a-string-variable/


Good luck,
Bompa
Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #4 on: July 12, 2011, 06:52:20 PM »

Interesting note - when I used the original URL, it popped to a cards page, rather than what I had asked for. But then I jumped down into one of the cards with this:
<?php

$buff 
file_get_contents('http://www.racingpost.com/horses2/cards/card.sd?race_id=534907&r_date=2011-07-13');
echo 
$buff;

?>

... like Bomps mentions and I got all the data. So I'm not sure what the trouble is.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Phaėton
Lifer
*****
Offline Offline

Posts: 555


⎝⏠⏝⏠⎠


View Profile
« Reply #5 on: July 12, 2011, 09:07:32 PM »


when I echo out the returned result............

perhaps your displaying characters that are getting marked up??

try

echo '<textarea rows=20 cols=80>'.$buff.'</textarea>';

?
Logged

When I was your age we used to walk to the TV to change the channel....  _̴ı̴̴̡̡̡ ̡͌l̡̡̡ ̡͌l̡*̡̡ ̴̡ı̴̴̡ ̡̡͡|̲̲̲͡͡͡ ̲▫̲͡ ̲̲̲͡͡π̲̲͡͡ ̲̲͡▫̲̲͡͡ ̲|̡̡̡ ̡ ̴̡ı̴̡̡
patch
Rookie
**
Offline Offline

Posts: 32


View Profile
« Reply #6 on: July 13, 2011, 12:14:22 AM »

Quote
... like Bomps mentions and I got all the data. So I'm not sure what the trouble is.

The trouble is that I'm particularly crap at scraping and I couldn't see the wood for the trees.

Quote
echo '<textarea rows=20 cols=80>'.$buff.'</textarea>';

That shows me that I am, in fact, getting all of the data ...

... thanks for helping guys, really appreciate it.
Logged
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #7 on: July 13, 2011, 08:24:54 AM »


when I echo out the returned result............

perhaps your displaying characters that are getting marked up??

try

echo '<textarea rows=20 cols=80>'.$buff.'</textarea>';

?


Oh, so putting it in a textarea makes it plain text?  I never thought of doing that.

cool dude
Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
mashal
Rookie
**
Offline Offline

Posts: 11


View Profile
« Reply #8 on: June 18, 2012, 04:18:29 AM »

i try using <text area> but i got no success em also scraping a website with this code
$target ='http://www.cleartrip.com/flights/results?from=CCU&to=DEL&depart_date=22/06/2012&adults=1&childs=0&infants=0&dep_time=0&class=Economy&airline=&carrier=&x=57&y=16&flexi_search=yes&tb=n';
   $data=file_get_contents($target);
   echo $data;
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #9 on: June 18, 2012, 11:30:40 PM »

I just commented in the thread you started on the same issue.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!