The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. December 05, 2008, 10:11:09 AM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: Backing up/Scraping SMF Forum  (Read 308 times)
sassy bear
Rookie
**
Offline Offline

Posts: 21


Sassy Bear


View Profile
« on: August 01, 2008, 11:34:33 AM »

Anyone have an experience they can share on how i would go about scraping an smf forum?
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 5230


:sniffle: Humor was so much easier before.


View Profile
« Reply #1 on: August 01, 2008, 12:52:14 PM »

You said "backup" or scrape... if you want to back your own up, that's easier than scraping someone else's because there are so many different themes out there. It's not that bad though if it's something you've got a hard on for.

Backing up: as an admin there's an SQL export option, that'll do you. Alternately, use phpMyAdmin to export the databases to a tarred file and you're all good.

If you're scraping then it's a completely different conversation... Wink
Logged

If I can't be Mr. Root then I don't want to play.
sassy bear
Rookie
**
Offline Offline

Posts: 21


Sassy Bear


View Profile
« Reply #2 on: August 01, 2008, 05:23:50 PM »

Okay thanks the sql is an option but i was thinking more of a robotic scraping type of backup -> the different conversation Wink 

.... Maybe triggered off of the rss feed or something.  To Keep It Simple, lets just say i wanted to scrape a flat url with lamp, and just store all the html in a varchar field.... how would i go about it?

Are there scrapers that accept cookies, etc as if you were a browser the way the internet explorer activex control used to do?

Logged
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Online Online

Posts: 636



View Profile
« Reply #3 on: August 01, 2008, 05:26:55 PM »

Okay thanks the sql is an option but i was thinking more of a robotic scraping type of backup -> the different conversation Wink 

.... Maybe triggered off of the rss feed or something.  To Keep It Simple, lets just say i wanted to scrape a flat url with lamp, and just store all the html in a varchar field.... how would i go about it?

Are there scrapers that accept cookies, etc as if you were a browser the way the internet explorer activex control used to do?



Well, a cookie is just some text sent in the header. It's the browser that puts it in a file. You could just make your table store the cookies as well.

I would set the field type that will store the HTML as "TEXT". You'll want to do a lot of escaping, and not rely on magic quotes or any other pseudo-security shenanigans...or maybe that's just me that would want to do that  ROFLMAO
Logged

perkiset
Olde World Hacker
Administrator
Lifer
*****
Online Online

Posts: 5230


:sniffle: Humor was so much easier before.


View Profile
« Reply #4 on: August 01, 2008, 05:40:23 PM »

.... Maybe triggered off of the rss feed or something.  To Keep It Simple, lets just say i wanted to scrape a flat url with lamp, and just store all the html in a varchar field.... how would i go about it?
Forget about a forum for just a moment and let's just talk about straight up scraping. You'll use something like cURL, or file get contents or the Web request class here in the PHP code repository to simply grab a page, parse it the way you want to, then stored in a database. Scraping websites is much like scraping travel systems as we did so many years ago.

Are there scrapers that accept cookies, etc as if you were a browser the way the internet explorer activex control used to do?
VS is right, cookies are simply a name value pair that's exchanged between a server and a client. They can be used for any sorts of purposes, not the least of which is to remember a session ID -- much like this forum or any retail site does. If you look at the web requests class you'll see how I handle cookies, both sending them to a server and storing them locally.  Cookies really don't have anything to do with scraping unless the server requires that you have a session ID when you request a page.
Logged

If I can't be Mr. Root then I don't want to play.
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!