formatting scraped data

To The Live Thread

Thread: formatting scraped data

Back to category: PHP: Questions & Discussion

timjohn

hello all,

i am having a bit of a hang up on a couple of points here. i have built a scraper that scrapes google answers for questions in 120 different niches. i am scraping the first 100 results in google and writing them to a db. a couple of questions:

1.) i am scraping the title, category, question, and answer. i am doing strip_tags() and mysql_real_escape_string() on each string, as i assign it to an array. however, how should i go about getting the text formatted pro

perl

y? currently, when displayed, it doesn't break at the correct spots and instead has " " or " ". when i don't use the mysql escape function it still doesn't display pro

perl

y, however each string is sans any carriage return or newlines characters of html encoded "&...." characters. how can i retain this formatting when writing to a mysql db? (note: i am going to be pulling these and posting to similar q and a sites)

2.) i am getting a sql error on the very first title it tries to insert... i believe this is caused by a "?" character. what is most likely causing this and how do i avoid? i thought it was the previously mentioned escaping function... i have played with my mysql wrapper functions in many other instances and they have not caused me any problems.

3.) as far as removing the same entries i have recorded twice, what is the best / most efficient way to go about this? is it best to check to see if the same question has already been scraped before i write to the db, or after the fact? with 120 niches, an average of 5 keywords per niche, that is 60k questions to be scraped. obv i think i am going to have to break this up into a handful of runs so that it doesn't ap

pear

i am a bot to google (i am using curl wrapped in a simple class). i am also considering using a yahoo API for this, but they don't have every google answers page indexed.

any ideas would be appreciated. i have been digging on

php

.net

and i know the info is there somewhere. i have tried many different solutions for each of these dilemnas with no luck.

thanks in advance,

tj Applause

georgiecasey

1)
So you get a scraped title like this: How to find porn on the

net

. What I'd do is str_replace all and with a <br /> and insert that into the database. Also, I think its addslashes() you need to escape everything for mysql

2)
Again, addslashes() would escape ? for mysql. Try that funtion instead of mysql_real_escape_string

3)
The way I'd do it (which is the lazy and inefficient way) is to just place an unique index on the title field in the mysql table so you get no dupes.

perkiset

@ #3: You'll need some kind of hashing algorithm to do it right. Perhaps something that walks the blob and takes every 20th char and puts it into a string as a key for the table... for example: Consider this email I scoffed:


I am glad to hear your trip went well. Ours
was also good, just far more walking and hiking
than I had planned, but I did come home to find
I had lost two pounds with no dieting. I had better
start hiking some mountains!

Let me know when would be best with your schedule
to meet since I do want to continue but just to map out
what we need to do over the next few months so I
think I would want to meet with Kelly also.

Call me soon to let me know your schedule,
[/pre]

... If I take every 20th char, I get something like this:
IuOj de is (you get the idea)

This key string is then placed into a field with a unique index... the processing to get that string is really easy... here's an example:

function getCharHash($inBuff)
{
	$inBuff = str_replace(' ', '', strtolower($inBuff));
	$out = array();
	$max = strlen($inBuff);
	for ($i=0; $i<$max; $i+=20)
		$out[] = $inBuff[$i];
	return implode('', $out);
}

This string will be 1/20th of the length of your total buffer and will give you an accurate look at your duplicity. Note that in my little function I excluded spaces and lower-cased everything to be a bit more persnikety.

Just an idea. Good luck,
/p

timjohn

wow thanks perk and georgie. i had forgot about this thread, really appreciate a quality response!

i haven't spent much time here, but it ap

pear

s as if the forum is becoming more active, def a plus! i look forward to

learn

ing and contributing more.

timjohn

perkiset

No worries TJ -

In fact, I must have just clicked "Mark all as read" or something because I can't believe that I didn't respond earlier to your post.

Glad to be of help, hope you'll stick around more.

/p

Bompa

I just see this thread now.

Anyways Dirk posted a way to accept only acsii characters by the hex codes. It's easier than tryng to eliminate others. Not sure if it fits your situation tho.

It's in

perl

, but you get the idea:

$string =~ s/[^x00-x7E]//sg; # ASCII 0 - 127
http://www.perkiset.org/forum/

perl

_coding_best_practices/filtering_bad_characters-t372.0.html;msg2457#msg2457

Bompa

Thread Categories

		Best of The Cache Home
		Search The Cache