
![]() |
timjohn
hello all,
i am having a bit of a hang up on a couple of points here. i have built a scraper that scrapes google answers for questions in 120 different niches. i am scraping the first 100 results in google and writing them to a db. a couple of questions: 1.) i am scraping the title, category, question, and answer. i am doing strip_tags() and mysql_real_escape_string() on each string, as i assign it to an array. however, how should i go about getting the text formatted pro perly? currently, when displayed, it doesn't break at the correct spots and instead has " " or " ". when i don't use the mysql escape function it still doesn't display properly, however each string is sans any carriage return or newlines characters of html encoded "&...." characters. how can i retain this formatting when writing to a mysql db? (note: i am going to be pulling these and posting to similar q and a sites)2.) i am getting a sql error on the very first title it tries to insert... i believe this is caused by a "?" character. what is most likely causing this and how do i avoid? i thought it was the previously mentioned escaping function... i have played with my mysql wrapper functions in many other instances and they have not caused me any problems. 3.) as far as removing the same entries i have recorded twice, what is the best / most efficient way to go about this? is it best to check to see if the same question has already been scraped before i write to the db, or after the fact? with 120 niches, an average of 5 keywords per niche, that is 60k questions to be scraped. obv i think i am going to have to break this up into a handful of runs so that it doesn't ap peari am a bot to google (i am using curl wrapped in a simple class). i am also considering using a yahoo API for this, but they don't have every google answers page indexed.any ideas would be appreciated. i have been digging on php.netand i know the info is there somewhere. i have tried many different solutions for each of these dilemnas with no luck.thanks in advance, tj ![]() ![]() georgiecasey
1)
So you get a scraped title like this: How to find porn on the net. What I'd do is str_replace all and with a <br /> and insert that into the database. Also, I think its addslashes() you need to escape everything for mysql2) Again, addslashes() would escape ? for mysql. Try that funtion instead of mysql_real_escape_string 3) The way I'd do it (which is the lazy and inefficient way) is to just place an unique index on the title field in the mysql table so you get no dupes. perkiset
@ #3: You'll need some kind of hashing algorithm to do it right. Perhaps something that walks the blob and takes every 20th char and puts it into a string as a key for the table... for example: Consider this email I scoffed:
timjohn
wow thanks perk and georgie. i had forgot about this thread, really appreciate a quality response!
i haven't spent much time here, but it ap pears as if the forum is becoming more active, def a plus! i look forward to and contributing more. |

Thread Categories

![]() |
![]() |
Best of The Cache Home |
![]() |
![]() |
Search The Cache |
- Ajax
- Apache & mod_rewrite
- BlackHat SEO & Web Stuff
- C/++/#, Pascal etc.
- Database Stuff
- General & Non-Technical Discussion
- General programming, learning to code
- Javascript Discussions & Code
- Linux Related
- Mac, iPhone & OS-X Stuff
- Miscellaneous
- MS Windows Related
- PERL & Python Related
- PHP: Questions & Discussion
- PHP: Techniques, Classes & Examples
- Regular Expressions
- Uncategorized Threads