Thread: HTML Parsing?

Back to category: BlackHat SEO & Web Stuff

DangerMouse

Hi all,

Just a quick question out of curiousity rather than based on a specific example at the moment - what method do you guys tend to use for HTML parsing?

Am aware I can attempt to

learn

preg_match and

regular expression

s but these seem to suffer from complexity and potential unreliability (depending on the content on the page and how often it changes). I've spotted a few classes although none of them ap

pear

to do have the full functionality i expected. Although HTML isnt always strict XML I was expecting something that would allow me to 'walk' it in the same way, accessing elements, attributes and contents on both a name and parent-child basis.

Just thought I might have missed something obvious, or maybe I'm expecting it too 'easy' ? (although little is easy for a noob

php

coder like me!)

Cheers,

Steve

perkiset

It completely depends on what you are trying to do.

If you are simply trying to extract a couple things from a page, and you know what they look like, the most effective way to do it is to

learn

to use

regex

s - and then the preg_match and preg_match_all functions are REALLY handy. IF it's something really small then the substr and strpos functions can be employed, but that will get out of hand quickly.

Learn

ing

regex

: try this:
http://www.regular-expressions.info/

tutor

ial.html

... but I'd recommend buying the book because its a great desk references. I've had one on my desktop bookshelf for a couple years and it rocks.

If you are trying to do more complicated extraction ie., something that either transcends simple divs or is just too complicated to

regex

around, then a C-string style "walker" is effective - you literally walk across the string char by char, and when you notice that you want to start recording chars into another string you do it... then you turn it off when you are done. This is the beginnings of a state

mac

hine parser, which is far more complicated than you want to endeavor - but that is how HTML is parsed by browsers. This is C-string oriented and understands the states of a variety of data points, like BOLD or ITALIC or "In a row" then "In a cell" ... very complicated and hardly necessary unless you want to go toe-to-toe with Firefox.

And I'm thinking that doing such a thing in

PHP

would be bad... Applause

The worst thing with any form of "state" parsing is ill-formed HTML... someone opens a bold state and then forgets to shut it off... this is a silly example, but in the realy world if you're trying to do intelligent parsing of other peoples' work you'll definitely come across it.

Hope this helps... if you have a more specific target in mind post and let's take a crack at 'er...

/p

DangerMouse

That site looks really comprehensive thanks Perk! Although its sure to make my head explode I'm sure lol Applause

I had assumed that poorly written html would pose a problem, indeed thats the flaw I've found with the classes I've found so far. I just thought it was strange that I hadnt stumbled upon a method similar to the way

javascript

walks the dom, or simplexml walks xml files.

Once again it seems that the long way round is most appropriate

Ta for the tips.

Steve

mrsdf

Regex

is the way to go for most stuff.

If you're interested in the structure try tidy:
tidy.sourceforge

.net

php

.net

/tidy

perkiset

quote author=DangerMouse link=topic=436.msg2867#msg2867 date=1187175203

I just thought it was strange that I hadnt stumbled upon a method similar to the way

javascript

walks the dom, or simplexml walks xml files.

You bring up an interesting point. The JITKO worm did some interesting stuff with JS, and the notion of parsing a page using a browser and JS is not without some merit. Consider, for example, that IE and FF have spent a LOT of time parsing HTML and dealing with ill-formed stuff, converting it into the DOM. As I'm sitting here, I'm thinking that a fantastic way to dissect a page entirely would be to have JS pull it down, it would be converted to DOM, then simply walk the DOM exporting it as XML and

AJAX

ing it up to a server that handled storage. In that way, you would not need a state parser or much of anything - the browser would do the hard work for you. You would also have instant, and perfect access to virtually any

asp

ect of the page you wanted in a way that steps outside the confines of normal text parsing.

Wow, holy fish. Never thought of it that way. Goddammit I have work to do!!! And there you go spinning my gears! Applause

Gonna have to think about that a bit... thanks for the idea!
/p

DangerMouse

lol glad to contribute, after a fashion Applause

! Theres loads of merit in having a perfect XML file to work with, and taking advantage of the browsers hard work sounds like a a great way of doing it.

Although the technique is way beyond me at present I can't help being curious how a remote HTML file could be pulled in and accessed with JS? I've attempted accessing the DOM of a page within an iFrame for example before and failed misserably, from what I read I didnt think it was posisble to manipulate the DOM of a remote page - although adimittidly this probably wasnt the best thing to try with one of my first forays into JS!

Thinking about it I think I just overcomplicated the process, I guess you would just need to walk what was printed out on the page,

PHP

could grab any HTML (should a remote page be desired Applause

) before hand and just print it back out, together with the JS to do the rest?

Interesting! But its probably wise for me to stick to and

learn

the traditional methods for now. The Tidy library looks promising, thanks mrsdf.

So much to

learn

so little time!

Steve

webprofessor

You guys already covered pretty much the only way to do it in

PHP

( via

regular expression

s ). Its not a

PHP

solution but.... If I have a lot of html parsing to do and I know its likely going to be invalid I prefer using C# and then running IE from it.

m0nkeymafia

Learn

regex

, itll be the best thing you ever did, these guys helped me

learn

the fooker so they can probably help you too.

Alternatively you could just curl the webpage in then parse it using the

PHP

SAX Parser, not sure how tolerant it is to malformed html though, and you would need the page to conform well to XHTML.

However

regex

wont be a problem if the site is a large "player" as they rarely change the html of their pages

Thread Categories

		Best of The Cache Home
		Search The Cache