The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 18, 2019, 12:44:37 PM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: HTML Parsing?  (Read 4858 times)
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« on: August 14, 2007, 04:29:53 PM »

Hi all,

Just a quick question out of curiousity rather than based on a specific example at the moment - what method do you guys tend to use for HTML parsing?

Am aware I can attempt to learn preg_match and regular expressions but these seem to suffer from complexity and potential unreliability (depending on the content on the page and how often it changes). I've spotted a few classes although none of them appear to do have the full functionality i expected. Although HTML isnt always strict XML I was expecting something that would allow me to 'walk' it in the same way, accessing elements, attributes and contents on both a name and parent-child basis.

Just thought I might have missed something obvious, or maybe I'm expecting it too 'easy' ? (although little is easy for a noob php coder like me!)

Cheers,

Steve


Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #1 on: August 14, 2007, 04:38:06 PM »

It completely depends on what you are trying to do.

If you are simply trying to extract a couple things from a page, and you know what they look like, the most effective way to do it is to learn to use regexs - and then the preg_match and preg_match_all functions are REALLY handy. IF it's something really small then the substr and strpos functions can be employed, but that will get out of hand quickly.

Learning regex: try this:
http://www.regular-expressions.info/tutorial.html

... but I'd recommend buying the book because its a great desk references. I've had one on my desktop bookshelf for a couple years and it rocks.

If you are trying to do more complicated extraction ie., something that either transcends simple divs or is just too complicated to regex around, then a C-string style "walker" is effective - you literally walk across the string char by char, and when you notice that you want to start recording chars into another string you do it... then you turn it off when you are done. This is the beginnings of a state machine parser, which is far more complicated than you want to endeavor - but that is how HTML is parsed by browsers. This is C-string oriented and understands the states of a variety of data points, like BOLD or ITALIC or "In a row" then "In a cell" ... very complicated and hardly necessary unless you want to go toe-to-toe with Firefox.

And I'm thinking that doing such a thing in PHP would be bad... Wink

The worst thing with any form of "state" parsing is ill-formed HTML... someone opens a bold state and then forgets to shut it off... this is a silly example, but in the realy world if you're trying to do intelligent parsing of other peoples' work you'll definitely come across it.

Hope this helps... if you have a more specific target in mind post and let's take a crack at 'er...

/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #2 on: August 15, 2007, 03:53:23 AM »

That site looks really comprehensive thanks Perk! Although its sure to make my head explode I'm sure lol  Shocked

I had assumed that poorly written html would pose a problem, indeed thats the flaw I've found with the classes I've found so far. I just thought it was strange that I hadnt stumbled upon a method similar to the way javascript walks the dom, or simplexml walks xml files.

Once again it seems that the long way round is most appropriate  Smiley

Ta for the tips.

Steve
Logged
mrsdf
Rookie
**
Offline Offline

Posts: 20



View Profile
« Reply #3 on: August 15, 2007, 12:58:23 PM »

Regex is the way to go for most stuff.

If you're interested in the structure try tidy:
tidy.sourceforge.net
php.net/tidy

Logged

We're sp4mmin', we're sp4mmin', I hope you like sp4mmin' too...
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #4 on: August 15, 2007, 01:27:59 PM »

I just thought it was strange that I hadnt stumbled upon a method similar to the way javascript walks the dom, or simplexml walks xml files.

You bring up an interesting point. The JITKO worm did some interesting stuff with JS, and the notion of parsing a page using a browser and JS is not without some merit. Consider, for example, that IE and FF have spent a LOT of time parsing HTML and dealing with ill-formed stuff, converting it into the DOM. As I'm sitting here, I'm thinking that a fantastic way to dissect a page entirely would be to have JS pull it down, it would be converted to DOM, then simply walk the DOM exporting it as XML and AJAXing it up to a server that handled storage. In that way, you would not need a state parser or much of anything - the browser would do the hard work for you. You would also have instant, and perfect access to virtually any aspect of the page you wanted in a way that steps outside the confines of normal text parsing.

Wow, holy fuck. Never thought of it that way. Goddammit I have work to do!!! And there you go spinning my gears!  ROFLMAO

Gonna have to think about that a bit... thanks for the idea!
/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #5 on: August 15, 2007, 04:02:00 PM »

lol glad to contribute, after a fashion  Wink ! Theres loads of merit in having a perfect XML file to work with, and taking advantage of the browsers hard work sounds like a a great way of doing it.

Although the technique is way beyond me at present I can't help being curious how a remote HTML file could be pulled in and accessed with JS? I've attempted accessing the DOM of a page within an iFrame for example before and failed misserably, from what I read I didnt think it was posisble to manipulate the DOM of a remote page - although adimittidly this probably wasnt the best thing to try with one of my first forays into JS!

Thinking about it I think I just overcomplicated the process, I guess you would just need to walk what was printed out on the page, PHP could grab any HTML (should a remote page be desired  Wink ) before hand and just print it back out, together with the JS to do the rest?

Interesting! But its probably wise for me to stick to and learn the traditional methods for now. The Tidy library looks promising, thanks mrsdf.

So much to learn so little time!

Steve
Logged
webprofessor
Rookie
**
Offline Offline

Posts: 15


View Profile
« Reply #6 on: August 30, 2007, 10:54:11 AM »

You guys already covered pretty much the only way to do it in PHP ( via regular expressions ). Its not a PHP solution but.... If I have a lot of html parsing to do and I know its likely going to be invalid I prefer using C# and then running IE from it.
Logged

No links in signatures please
m0nkeymafia
Expert
****
Offline Offline

Posts: 240


Check it!


View Profile
« Reply #7 on: September 03, 2007, 08:23:05 AM »

Learn regex, itll be the best thing you ever did, these guys helped me learn the fooker so they can probably help you too.

Alternatively you could just curl the webpage in then parse it using the PHP SAX Parser, not sure how tolerant it is to malformed html though, and you would need the page to conform well to XHTML.

However regex wont be a problem if the site is a large "player" as they rarely change the html of their pages
Logged

I am Tyler Durden
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!