The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 23, 2019, 01:50:29 AM

Login with username, password and session length


Pages: [1] 2
  Print  
Author Topic: How to scrape some text ?  (Read 4882 times)
NYDAz
Expert
****
Offline Offline

Posts: 212

The Night Stalker


View Profile
« on: June 19, 2009, 09:19:26 AM »

The main idea is this !

To scrape some text about a product and put it on my page !

I don't want to copy/paste manually every product description for every product !

Any idea on how to start ?

 Shocked
Logged

what's up?
deregular
Expert
****
Offline Offline

Posts: 172


View Profile
« Reply #1 on: June 19, 2009, 09:33:54 AM »

$text = file_get_contents('http://www.youdomain.com');
and then grab what you need with preg_match().

Or you could use curl.

Or use the Dom to navigate around.
(vsloathe will have more info on this)

There are several ways.

What are you trying to scrape? A webpage? or some text from a file?
Logged
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #2 on: June 19, 2009, 09:40:50 AM »

Show me a sample of what you want to scrape.
Logged

hai
NYDAz
Expert
****
Offline Offline

Posts: 212

The Night Stalker


View Profile
« Reply #3 on: June 19, 2009, 09:51:37 AM »

h++p://www.generic-pharmacy-us.net/generic-levitra.html [edit, perk: let's not push any juice that way, shall we?]

this text

Quote
Generic Levitra is an oral drug used to treat male erectile dysfunction (ED) also referred to as impotence. Generic Levitra shouldn't be taken more then once a day. Taken 60 minutes before intercourse Generic Levitra remains active between 4 -20 hours. The success rate of oral ED drugs is very high, above 90%; however different people require different dosages to attain optimum results. Generic Levitra has the same active ingredient as brand name Levitra, and is equivalent in effect, strength, and dosage.

 Nerd
« Last Edit: June 19, 2009, 07:06:27 PM by perkiset » Logged

what's up?
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #4 on: June 19, 2009, 10:07:18 AM »

Not so sure I want to visit that link from my desk at work...

 ROFLMAO
Logged

hai
NYDAz
Expert
****
Offline Offline

Posts: 212

The Night Stalker


View Profile
« Reply #5 on: June 19, 2009, 10:11:07 AM »

Not so sure I want to visit that link from my desk at work...

 ROFLMAO
Grin

Another example then :

h++p://www.amazon.com/Millionaire-Next-Door-Thomas-Stanley/dp/0671015206 [edit, perk: or that one. Amazon's got enough, thank you.]

this text

Quote
From Library Journal
In The Millionaire Next Door, read by Cotter Smith, Stanley (Marketing to the Affluent) and Danko (marketing, SUNY at Albany) summarize findings from their research into the key characteristics that explain how the elite club of millionaires have become "wealthy." Focusing on those with a net worth of at least $1 million, their surprising results reveal fundamental qualities of this group that are diametrically opposed to today's earn-and-consume culture, including living below their means, allocating funds efficiently in ways that build wealth, ignoring conspicuous consumption, being proficient in targeting marketing opportunities, and choosing the "right" occupation. It's evident that anyone can accumulate wealth, if they are disciplined enough, determined to persevere, and have the merest of luck. In The Millionaire Mind, an excellent follow-up to the highly successful first analysis of how ordinary folks can accumulate wealth, Stanley interviews many more participants in a much more comprehensive study of the characteristics of those in this economic situation. The author structures these deeper details into categories that include the key success factors that define this group, the relationship of education to their success, their approach to balancing risk, how they located themselves in their work, their choice of spouse, how they live their daily lives, and the significant differences in the truth about this group vs. the misplaced image of high spenders. Narrator Smith's solid, dead-on reading never fails to heighten the importance of these principles that most twentysomethings should be forced to listen to in toto. Highly recommended for all public libraries. Dale Farris, Groves, TX
« Last Edit: June 19, 2009, 07:07:07 PM by perkiset » Logged

what's up?
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #6 on: June 19, 2009, 10:31:05 AM »

That content is on the page in a div called "content". I would use the page's DOM to get it.

Code:
<?php

$htmlContent 
file_get_contents('http://www.amazon.com/Millionaire-Next-Door-Thomas-Stanley/dp/0671015206');
$DD = new DOMDocument();
@
$DD->loadHTML($htmlContent);
@
$nodes $DD->getElementsByTagName('div');
foreach(
$nodes as $node){
   if(
$node->getAttribute('class') == 'content'){
      
$someContent $node->getAttribute('innerHTML');
   }
}
?>


I think that will work. Not sure, I just typed it in this window and I don't really have the time just now to test it.
Logged

hai
NYDAz
Expert
****
Offline Offline

Posts: 212

The Night Stalker


View Profile
« Reply #7 on: June 19, 2009, 10:39:46 AM »

let's see !

thanks for your time vs Wink

later edit : it's possible to eliminate links from text ? or some words ?
« Last Edit: June 19, 2009, 10:44:02 AM by NYDAz » Logged

what's up?
NYDAz
Expert
****
Offline Offline

Posts: 212

The Night Stalker


View Profile
« Reply #8 on: June 19, 2009, 10:52:42 AM »

trying to echo $someContent

blank page

 Embarrassed
Logged

what's up?
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #9 on: June 19, 2009, 11:14:11 AM »

I think innerHTML is the wrong attribute...

but I can't remember the right one now... textContent maybe?
Logged

hai
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #10 on: June 19, 2009, 11:14:43 AM »

oh, it might just be value.

hmmm
Logged

hai
NYDAz
Expert
****
Offline Offline

Posts: 212

The Night Stalker


View Profile
« Reply #11 on: June 19, 2009, 01:25:41 PM »

still searching  ROFLMAO
Logged

what's up?
ehlo
Journeyman
***
Offline Offline

Posts: 50


View Profile
« Reply #12 on: June 19, 2009, 03:59:30 PM »

Try

Code:
$someContent = $node->nodeValue;
Logged
NYDAz
Expert
****
Offline Offline

Posts: 212

The Night Stalker


View Profile
« Reply #13 on: June 19, 2009, 05:16:18 PM »

Quote
Catchable fatal error: Object of class DOMDocument could not be converted to string in /home/oleero/public_html/x.php on line 12

<edit>
time to sleep now ! see ya tomorrow !
</edit>
« Last Edit: June 19, 2009, 06:11:05 PM by NYDAz » Logged

what's up?
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #14 on: June 19, 2009, 07:09:15 PM »

That content is on the page in a div called "content". I would use the page's DOM to get it.

Code:
<?php

$htmlContent 
file_get_contents('http://www.amazon.com/Millionaire-Next-Door-Thomas-Stanley/dp/0671015206');
$DD = new DOMDocument();
@
$DD->loadHTML($htmlContent);
@
$nodes $DD->getElementsByTagName('div');
foreach(
$nodes as $node){
   if(
$node->getAttribute('class') == 'content'){
      
$someContent $node->getAttribute('innerHTML');
   }
}
?>


I think that will work. Not sure, I just typed it in this window and I don't really have the time just now to test it.

I didn't look at the HTML, but if you say it's in a div called content, then wouldn't you be grabbing getElementById('content') rather than by class?
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Pages: [1] 2
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!