The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 16, 2019, 05:15:42 PM

Login with username, password and session length


Pages: [1] 2
  Print  
Author Topic: stupid regex question: all or nothing  (Read 7592 times)
ksan
Rookie
**
Offline Offline

Posts: 10


View Profile
« on: October 24, 2008, 01:37:49 AM »

Hey guys,

This might be a totaly stupid thread, but please bear with me, I did not find anything related on the net.
I have built a bot that scrapes some selected sites for content rewriting information gathering purposes.
I am relatively new to regex but have had great success with it.
However there seems to be a logical problem I can't wrap my head around. My regex either have no appetite at all, or their the greediest bunch around.
Example site code:
<div id="content">
<p>blabla content bla</p>
</div>
<dvi id="nextcontainer>
<p>blabla</p>
</div>

So in order to extract everything inside the div content my pattern would be '/id\=\"content\"\>(.*)\<\/div\>/' .
Well this does not work, because it only searches in the first line. Ok, so a quick insertion of \s and it looks like:  '/id\=\"content\"\>(.\s*)\<\/div\>/' however this does not work either. No matter where I place the \s I don't get any different result. What works is this: '/id\=\"content\"\>(.*)\<\/div\>/s' .
This will however match everything from the start of the content div, to the very last </div>.
Ok, so wouldn't I need to specify how often a </div> can be included ? Thought so, but neither
'/id\=\"content\"\>(.*)(\<\/div\>)?/s' nor '/id\=\"content\"\>(.*)(\<\/div\>){1}/s' is working.
So at the end, the only working solution for me is this : '/id\=\"content\"\>(.*)id\=\"nextcontainer\"\>/s'

Would be awesome if someone could point out where my logic is wrong and explain why the patterns can't work.
By the way, I am running PHP 5.2.6 and I am using the preg_match_all function.

Logged

No links in signatures please
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #1 on: October 24, 2008, 04:44:54 AM »

Hey guys,

This might be a totaly stupid thread, but please bear with me, I did not find anything related on the net.
I have built a bot that scrapes some selected sites for content rewriting information gathering purposes.
I am relatively new to regex but have had great success with it.
However there seems to be a logical problem I can't wrap my head around. My regex either have no appetite at all, or their the greediest bunch around.
Example site code:
<div id="content">
<p>blabla content bla</p>
</div>
<dvi id="nextcontainer>
<p>blabla</p>
</div>

So in order to extract everything inside the div content my pattern would be '/id\=\"content\"\>(.*)\<\/div\>/' .
Well this does not work, because it only searches in the first line. Ok, so a quick insertion of \s and it looks like:  '/id\=\"content\"\>(.\s*)\<\/div\>/' however this does not work either. No matter where I place the \s I don't get any different result. What works is this: '/id\=\"content\"\>(.*)\<\/div\>/s' .
This will however match everything from the start of the content div, to the very last </div>.
Ok, so wouldn't I need to specify how often a </div> can be included ? Thought so, but neither
'/id\=\"content\"\>(.*)(\<\/div\>)?/s' nor '/id\=\"content\"\>(.*)(\<\/div\>){1}/s' is working.
So at the end, the only working solution for me is this : '/id\=\"content\"\>(.*)id\=\"nextcontainer\"\>/s'

Would be awesome if someone could point out where my logic is wrong and explain why the patterns can't work.
By the way, I am running PHP 5.2.6 and I am using the preg_match_all function.



Hey Ksan, your question is not stupid, but sure is confusing.  Cheesy

Firstly, you don't need to escape = or " or >

If you want only the content of the first div...
Code:
$string =~ /<div(.*?)<\/div>/s;
print "$1\n";

That's in perl, but you get the idea.

Main difference is the ?. It says "until the first occurrance of whatever follows"

« Last Edit: October 24, 2008, 04:56:34 AM by Bompa » Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
ksan
Rookie
**
Offline Offline

Posts: 10


View Profile
« Reply #2 on: October 24, 2008, 05:08:23 AM »

Ah cool thanks Bompa,
I thought ? meant that the group or sign in front of ? can only occur once.
So what if I whant everything in before the third </div> tag ?
Is it /<div(.*){3}</div>/s  Huh?
Logged

No links in signatures please
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #3 on: October 24, 2008, 07:09:19 AM »

Ah cool thanks Bompa,
I thought ? meant that the group or sign in front of ? can only occur once.
So what if I whant everything in before the third </div> tag ?
Is it /<div(.*){3}</div>/s  Huh?

It doesn't work that way and it's not going to be easy to do.

Bompa
Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #4 on: October 24, 2008, 09:01:09 AM »

ksan if you're trying to get "all the div content" then perhaps you might consider a completely different approach.

Consider this:

http://us.php.net/dom

... the DOM model for PHP will allow you to load an HTML file, it will parse it and everything for you. Then what you'll need to do is either walk it or get all content via getElementsByTagName or what have you, but I guess my point is: if you're looking for a single little thing in an HTML file, regex is hot. If you're looking to tear apart and interpret an HTML file, then perhaps the DOM is the way to go.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #5 on: October 24, 2008, 09:26:05 AM »

dom will only work though if the HTML is valid as XML. most online is not valid HTML let alone XML, because most webmasters are fucktards.

The ? actually means repeat the prior capture, non-greedy.
so something like this
<div>test</div><div>test2</div>

#<div>(.*)</div>#s   will output:   test</div><div>test2

#<div>(.*?)</div>#s will output: test   or an array with test and test2 in it depending if you use preg_match_all


BTW, I use # symbols instead of / as my end caps so I dont have to escape slashes Wink

The other option frankly is to not do regex the way you are trying which is in one shot. You can't for a couple of reasons.
I assume that the target content is "not predictable" in its structure? If it is, then you can make a single pattern. If not, you will probably need to do multistep. Realize that you could accomplish the same exact things by removing stuff.
preg_replace is your friend. Smiley

Code:

$arr = preg_split('#</body.*?>#is',$fullhtmlpage);
$a = $arr[0];
//delete contents of these tags
$a = preg_replace('#<style.*?>.*?</style>#is',' ',$a);
$a = preg_replace('#<script.*?>.*?</script>#is',' ',$a);
$a = preg_replace('#<form.*?>.*?</form>#is',' ',$a);
$a = preg_replace('#<!--.*?-->#is',' ',$a);
$a = preg_replace('#&lt;!--.*?--&gt;#is',' ',$a);
$a = preg_replace('#<img.*?>#is',' ',$a);
$a = preg_replace('#<br>#is',' ',$a);
$a = preg_replace('#<br />#is',' ',$a);
$a = preg_replace('#<br/>#is',' ',$a);
$a = preg_replace('#<p.*?>#is',' ',$a);
$a = preg_replace('#</p.*?>#is',' ',$a);
$a = str_replace('&nbsp;',' ',$a);
$a = preg_replace('#<.+?>#si',' ',$a); //strip remaining tags, like custom XML
$a = preg_replace('#\s#is',' ',$a);
$a = preg_replace('# {1,}#is',' ',$a);
$pileofsentencesforyoutobreakaparthowyouwish = trim($a);

Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #6 on: October 24, 2008, 10:45:34 AM »

dom will only work though if the HTML is valid as XML. most online is not valid HTML let alone XML, because most webmasters are fucktards.
Sorry Nuts, not true:

http://us.php.net/manual/en/domdocument.loadhtmlfile.php

dom->loadHTMLFile will load an illformed HTML file rather than an XML file quite effectively. It's handy, but I don't know what kind of overhead it imposes, or where the "doesn't break on badly formed" line actually is. But that's the way they sell the function.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #7 on: October 24, 2008, 10:49:57 AM »

Thats all i meant, though I said it a bit to agressively. It "will work" though it becomes increasingly more unpredictable as you introduce more and more errors.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #8 on: October 24, 2008, 11:05:58 AM »

Yep right tool for the job. PHP's DOM is the balls in general.
Logged

hai
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #9 on: October 24, 2008, 11:35:33 AM »

VS have you done any scoping to find out how heavy the DOM load functions are? It'd be great to know just how much you pay for DOM over regexing for a chunk...
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #10 on: October 24, 2008, 11:40:36 AM »

PHP's DOM is lighter weight than its tacked-on implementation of the PCRE, in general.

It's not really the calls to the PCRE it's the starting up of the PCRE module/interpreter.
Logged

hai
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #11 on: October 24, 2008, 01:05:29 PM »

I am assuming the case we are talking about here is scraping.
trying to get all the content from a page, which may or may not have been built by a retarded monkey.

In that case, Im not sure how the DOM would work. Though I guess you could walk the tree grabbing the content of each node. But im not sure how that works when you have something like this:

<div>This is some content<p>with more<div>nested junk</div> inside the </p> outer divs.</div>
OR
<div>this is some<p>content that is</div>badly formed crazy nested</p>


would I be able to easily grab all the content that way? I would love to see an example of how, just a simple one.
thats why I posted my regex stripper example.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #12 on: October 24, 2008, 01:10:42 PM »

nuts -at lunch mtg, will post example in hour or 2

Sent from my iPhone
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #13 on: October 24, 2008, 01:13:28 PM »

meh no biggie. just curious.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
ksan
Rookie
**
Offline Offline

Posts: 10


View Profile
« Reply #14 on: October 25, 2008, 08:21:08 AM »

wow thanks to all your feedback.
I'll definitely look into DOM.
As far as removing everthing with preg_replace, wouldn't it be easier/faster to just use strip_tags() ?
Logged

No links in signatures please
Pages: [1] 2
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!