The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 23, 2019, 12:11:29 AM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: formatting scraped data  (Read 3877 times)
timjohn
Rookie
**
Offline Offline

Posts: 15


View Profile
« on: July 18, 2007, 09:41:51 PM »

hello all,

i am having a bit of a hang up on a couple of points here. i have built a scraper that scrapes google answers for questions in 120 different niches. i am scraping the first 100 results in google and writing them to a db. a couple of questions:

1.) i am scraping the title, category, question, and answer. i am doing strip_tags() and mysql_real_escape_string() on each string, as i assign it to an array. however, how should i go about getting the text formatted properly? currently, when displayed, it doesn't break at the correct spots and instead has "\r" or "\n". when i don't use the mysql escape function it still doesn't display properly, however each string is sans any carriage return or newlines characters of html encoded "&...." characters. how can i retain this formatting when writing to a mysql db? (note: i am going to be pulling these and posting to similar q and a sites)

2.) i am getting a sql error on the very first title it tries to insert... i believe this is caused by a "?" character. what is most likely causing this and how do i avoid? i thought it was the previously mentioned escaping function... i have played with my mysql wrapper functions in many other instances and they have not caused me any problems.

3.) as far as removing the same entries i have recorded twice, what is the best / most efficient way to go about this? is it best to check to see if the same question has already been scraped before i write to the db, or after the fact?  with 120 niches, an average of 5 keywords per niche, that is 60k questions to be scraped. obv i think i am going to have to break this up into a handful of runs so that it doesn't appear i am a bot to google (i am using curl wrapped in a simple class). i am also considering using a yahoo API for this, but they don't have every google answers page indexed.

any ideas would be appreciated. i have been digging on php.net and i know the info is there somewhere.  i have tried many different solutions for each of these dilemnas with no luck.

thanks in advance,

tj  Huh?
Logged

No links in signatures please
georgiecasey
Rookie
**
Offline Offline

Posts: 16


View Profile
« Reply #1 on: August 01, 2007, 09:01:06 PM »

1)
So you get a scraped title like this: How to find porn\n on the net. What I'd do is str_replace all \n and \r with a
 and insert that into the database. Also, I think its addslashes() you need to escape everything for mysql

2)
Again, addslashes() would escape ? for mysql. Try that funtion instead of mysql_real_escape_string

3)
The way  I'd do it (which is the lazy and inefficient way) is to just place an unique index on the title field in the mysql table so you get no dupes.

Logged

No links in signatures please
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #2 on: August 01, 2007, 10:15:39 PM »

@ #3: You'll need some kind of hashing algorithm to do it right. Perhaps something that walks the blob and takes every 20th char and puts it into a string as a key for the table... for example: Consider this email I scoffed:

I am glad to hear your trip went well. Ours
was also good, just far more walking and hiking
than I had planned, but I did come home to find
I had lost two pounds with no dieting. I had better
start hiking some mountains!

Let me know when would be best with your schedule
to meet since I do want to continue but just to map out
what we need to do over the next few months so I
think I would want to meet with Kelly also.

Call me soon to let me know your schedule,


... If I take every 20th char, I get something like this:
IuOj de is (you get the idea)

This key string is then placed into a field with a unique index... the processing to get that string is really easy... here's an example:
Code:
function getCharHash($inBuff)
{
$inBuff = str_replace(' ', '', strtolower($inBuff));
$out = array();
$max = strlen($inBuff);
for ($i=0; $i<$max; $i+=20)
$out[] = $inBuff[$i];
return implode('', $out);
}
This string will be 1/20th of the length of your total buffer and will give you an accurate look at your duplicity. Note that in my little function I excluded spaces and lower-cased everything to be a bit more persnikety.

Just an idea. Good luck,
/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
timjohn
Rookie
**
Offline Offline

Posts: 15


View Profile
« Reply #3 on: August 20, 2007, 09:59:47 AM »

wow thanks perk and georgie. i had forgot about this thread, really appreciate a quality response!

i haven't spent much time here, but it appears as if the forum is becoming more active, def a plus! i look forward to learning and contributing more.

timjohn
Logged

No links in signatures please
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #4 on: August 20, 2007, 10:03:10 AM »

No worries TJ -

In fact, I must have just clicked "Mark all as read" or something because I can't believe that I didn't respond earlier to your post.

Glad to be of help, hope you'll stick around more.

/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #5 on: August 20, 2007, 07:42:04 PM »

I just see this thread now.

Anyways Dirk posted a way to accept only acsii characters by the hex codes.  It's easier than tryng to eliminate others.  Not sure if it fits your situation tho.

It's in perl, but you get the idea:

$string =~ s/[^\x00-\x7E]//sg;   # ASCII 0 - 127
http://www.perkiset.org/forum/perl_coding_best_practices/filtering_bad_characters-t372.0.html;msg2457#msg2457



Bompa
« Last Edit: August 20, 2007, 07:47:53 PM by Bompa » Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!