The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. October 14, 2019, 06:41:17 PM

Login with username, password and session length


Pages: [1] 2
  Print  
Author Topic: How to use regex to grab data from html file.  (Read 7180 times)
tommytx
Expert
****
Offline Offline

Posts: 123


View Profile WWW
« on: July 09, 2011, 11:05:28 AM »

Wow.. this thing is eating my lunch.. I hope one of you experts can help me out.. You have never failed before... I feel like I should put you guys on retainer cause anytime I find something that eats my lunch one of you simple tells me what I am doing wrong in just minutes.......

If its too confusing to interpret, I can give as much detail as you like.. Basically what happens is that when my idx sends me and email with the html data.. telling me I have a lead, I direct it to a pipe into a php file where I need to extract the important data... So yes this code to extract data needs to run in a php file that is being fed by a piped in email... Then once I have grabbed the data I will format it into a php post command to send to Aweber and constant contact and other places.... Thanks in advance for any advice.

Code:

Actual HTML FILE
****************
<html><b>Tom Chambers</b> (tom@myemail.com ) has signed up for a <a href="http://search.mydomain.com/idx/10159/userSignup.php">Listing Manager Account</a> on <a href="http://virginia-beach-home-for-sale.com/">http://virginia-beach-home-for-sale.com/</a>!<br /><br /><b>First Name</b>: Tom<br />
                                                <b>Last Name</b>: Chambers<br />
<b>Email</b>: tom@mydomain.com<br />
<b>Additional Email</b>: tom@mydomain.com<br /><b>Currently Selected office Agent</b>: The lead did not choose an office agent. <br />
                                                <b>Previously Assigned office Agent</b>: The lead is not assigned to an agent. <br /><b>Phone Number</b>: (757) 123-4567 <br />
<b>Address</b>: P.O. Box 14123 - Norfolk, VA 12345<br /><br /><br />As of this email, the last property detail page that this lead was viewing before signing up can be viewed by clicking <a href="http://search.virginia-beach-home-for-sale.com/idx/10159/details.php?idxID=000&listingID=73MXFC">this link</a><br /><br />Also, the last search that this lead performed can be recreated by clicking <a href="http://search.virginia-beach-home-for-sale.com/idx/10159/results.php?stp=basic&pt=sfr&showField=cityField&lp=200000&hp=800000&ba=0&srt=DESC&start=0&per=10&cid=10159">this link</a></html>



Grab this stuff out of the junk
*******************************
<b>First Name</b>: Tom<br />
<b>Last Name</b>: Chambers<br />
<b>Email</b>: tom@mydomain.com<br />
<b>Additional Email</b>: tom@mydomain.com<br />
<b>Currently Selected office Agent</b>: The lead did not choose an office agent. <br />
<b>Previously Assigned office Agent</b>: The lead is not assigned to an agent. <br />
<b>Phone Number</b>: (757) 123-4567 <br />
<b>Address</b>: P.O. Box 14123 - Norfolk, VA 12345<br />


Actually only need this in the end
**********************************
First Name: Tom
Last Name: Chambers
Email: tom@mydomain.com
Additional Email: tom@mydomain.com
Currently Selected office Agent: The lead did not choose an office agent.
Previously Assigned office Agent: The lead is not assigned to an agent.
Phone Number: (757) 123-4567
Address: P.O. Box 14123 - Norfolk, VA 12345





Here is what I have been playing with but nothing seems to work.. i have tried a ton of variations but not speaking fluent regular expressions does not help.
**********************************

$str = "<html><b>Tom Chambers</b> (tom@myemail.com ) has signed up for a <a href="http://search.mydomain.com/idx/10159/userSignup.php">Listing Manager Account</a> on <a href="http://virginia-beach-home-for-sale.com/">http://virginia-beach-home-for-sale.com/</a>!<br /><br /><b>First Name</b>: Tom<br />
                                                <b>Last Name</b>: Chambers<br />
<b>Email</b>: tom@mydomain.com<br />
<b>Additional Email</b>: tom@mydomain.com<br /><b>Currently Selected office Agent</b>: The lead did not choose an office agent. <br />
                                                <b>Previously Assigned office Agent</b>: The lead is not assigned to an agent. <br /><b>Phone Number</b>: (757) 123-4567 <br />
<b>Address</b>: P.O. Box 14123 - Norfolk, VA 12345<br /><br /><br />As of this email, the last property detail page that this lead was viewing before signing up can be viewed by clicking <a href="http://search.virginia-beach-home-for-sale.com/idx/10159/details.php?idxID=000&listingID=73MXFC">this link</a><br /><br />Also, the last search that this lead performed can be recreated by clicking <a href="http://search.virginia-beach-home-for-sale.com/idx/10159/results.php?stp=basic&pt=sfr&showField=cityField&lp=200000&hp=800000&ba=0&srt=DESC&start=0&per=10&cid=10159">this link</a></html>";






Here is my crazy expression.. I have tried a thousand different things.
***********************************************************************

preg_match('<br>(.*)<br>:(.*)<br />', $str, $matches);
print "<pre>";
print_r($matches);
print "</pre>";



The output I am hoping for... dream on...
*****************************************

Array
(
    [0] => First Name: Tom
    [1] => Last Name: Chambers
    [2] => Email: tom@mydomain.com
    [3] => Additional Email: tom@mydomain.com
    [4] => Currently Selected office Agent: The lead did not choose an office agent.
    [5] => Previously Assigned office Agent: The lead is not assigned to an agent.
    [6] => Phone Number: (757) 123-4567
    [7] => Address: P.O. Box 14123 - Norfolk, VA 12345

)

Or anything even close would be wonderful....
I just need the two pieces of data associated with each other.

Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #1 on: July 09, 2011, 11:39:21 AM »

Note that you have a very convenient layout here that really helps us with the regex, and the logical flow of how to do it - the (br /) after each one and the bolding of the caption. That's what we'll use to tackle this.

My first methodology is to break things into smaller chunks, then tackle each, rather than trying to make a complicated regex. The second is to do it all with a single preg_match_all.

Here's code that does it both ways:
<?php

$buff 
= <<<HTML
<html><b>Tom Chambers</b> (tom@myemail.com ) has signed up for a <a href="http://search.mydomain.com/idx/10159/userSignup.php">Listing Manager Account</a> on 
<a href="http://virginia-beach-home-for-sale.com/">http://virginia-beach-home-for-sale.com/</a>!<br /><br /><b>First Name</b>: Tom<br />
<b>Last Name</b>: Chambers<br />
<b>Email</b>: tom@mydomain.com<br />
<b>Additional Email</b>: tom@mydomain.com<br /><b>Currently Selected office Agent</b>: The lead did not choose an office agent. <br />
<b>Previously Assigned office Agent</b>: The lead is not assigned to an agent. <br /><b>Phone Number</b>: (757) 123-4567 <br />
<b>Address</b>: P.O. Box 14123 - Norfolk, VA 12345<br /><br /><br />
As of this email, the last property detail page that this lead was viewing before signing up can be viewed by clicking 
<a href="http://search.virginia-beach-home-for-sale.com/idx/10159/details.php?idxID=000&listingID=73MXFC">this link</a><br />
<br />Also, the last search that this lead performed can be recreated by clicking 
<a href="http://search.virginia-beach-home-for-sale.com/idx/10159/results.php?stp=basic&pt=sfr&showField=cityField&lp=200000&hp=800000&ba=0&srt=DESC&start=0&per=10&cid=10159">this link</a><
/html>
HTML;

// Walk each line looking for only what we want to keep...
$lines explode('<br />'$buff);
foreach(
$lines as $line)
{
	
if (
preg_match('/<b>([^<]*)<\/b>: (.*)/'$line$parts))
	
	
$output[$parts[1]] = $parts[2];
}
echo 
"Here's the tidy output:\n" print_r($outputtrue) . "\n\n";

// Execute it all with a single function:
preg_match_all('/<b>([^<]*)<\/b>: (.*)<br \/>/U'$buff$parts);
echo 
"Here's the raw multi-output:\n" print_r($partstrue);

// Clean up the $parts array to look like the first one ...
foreach($parts[1] as $idx=>$name)
	
$output[$name] = $parts[2][$idx];

echo 
"And here's the multi-output, cleaned up like the first one:\n" print_r($outputtrue) . "\n\n";

?>

Here is the console output:

Code:
Here's the tidy output:
Array
(
    [First Name] => Tom
    [Last Name] => Chambers
    [Email] => tom@mydomain.com
    [Additional Email] => tom@mydomain.com
    [Currently Selected office Agent] => The lead did not choose an office agent.
    [Previously Assigned office Agent] => The lead is not assigned to an agent.
    [Phone Number] => (757) 123-4567
    [Address] => P.O. Box 14123 - Norfolk, VA 12345
)


Here's the multi-output:
Array
(
    [0] => Array
        (
            [0] => <b>First Name</b>: Tom<br />
            [1] => <b>Last Name</b>: Chambers<br />
            [2] => <b>Email</b>: tom@mydomain.com<br />
            [3] => <b>Additional Email</b>: tom@mydomain.com<br />
            [4] => <b>Currently Selected office Agent</b>: The lead did not choose an office agent. <br />
            [5] => <b>Previously Assigned office Agent</b>: The lead is not assigned to an agent. <br />
            [6] => <b>Phone Number</b>: (757) 123-4567 <br />
            [7] => <b>Address</b>: P.O. Box 14123 - Norfolk, VA 12345<br />
        )

    [1] => Array
        (
            [0] => First Name
            [1] => Last Name
            [2] => Email
            [3] => Additional Email
            [4] => Currently Selected office Agent
            [5] => Previously Assigned office Agent
            [6] => Phone Number
            [7] => Address
        )

    [2] => Array
        (
            [0] => Tom
            [1] => Chambers
            [2] => tom@mydomain.com
            [3] => tom@mydomain.com
            [4] => The lead did not choose an office agent.
            [5] => The lead is not assigned to an agent.
            [6] => (757) 123-4567
            [7] => P.O. Box 14123 - Norfolk, VA 12345
        )

)

And here's the multi-output, cleaned up like the first one:
Array
(
    [First Name] => Tom
    [Last Name] => Chambers
    [Email] => tom@mydomain.com
    [Additional Email] => tom@mydomain.com
    [Currently Selected office Agent] => The lead did not choose an office agent.
    [Previously Assigned office Agent] => The lead is not assigned to an agent.
    [Phone Number] => (757) 123-4567
    [Address] => P.O. Box 14123 - Norfolk, VA 12345
)



The first example is easier to understand. I explode the buffer into single lines on (br /) then just have to do a preg_match for each line. In the second, I use preg_match_all to grab all of them - but the array is less finely ready for further usage, which is why I clean it up in the last section.

Good luck!
/p
« Last Edit: July 09, 2011, 11:57:45 AM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
tommytx
Expert
****
Offline Offline

Posts: 123


View Profile WWW
« Reply #2 on: July 09, 2011, 11:48:03 AM »

Thanks Perk... you are a genius... how does it feel to be a miracle worker... and that expands my php cabability for future projects.. also... learn a little more each day.. but I gotta  hurry there is not too many more days... 66 years old already...
Do you have a donation jar on the front page... you are worth your weight in gold...
I really appreciate what you and Nutballs do for this forum.... hee..hee.. you fixed it before he even got in here today.. but I doubt he will let it pass... he will probably throw in a few more items that teach me even more...

I had over 20 hours in this... and it took you less than an hour... and of that hour you were probably doing something else for 55 minutes of the hour..

Thanks again.
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #3 on: July 09, 2011, 11:51:46 AM »

Thanks Perk... you are a genius... how does it feel to be a miracle worker... and that expands my php cabability for future projects.. also... learn a little more each day.. but I gotta  hurry there is not too many more days... 66 years old already...
It's never too late, Tommy - I started programming over 35 years ago, but have to pretty much toss everything out every 3 years and learn it all again, because things change so much. We're all pretty much in the same boat Wink

Do you have a donation jar on the front page... you are worth your weight in gold...
Most kind, but no. I originally put this forum up for this express purpose. I'm where I am because a lot of people have assisted me along the way. It's an easy give-back.

I had over 20 hours in this... and it took you less than an hour... and of that hour you were probably doing something else for 55 minutes of the hour..
LOL actually, it took almost 10 minutes because I redid it several times for cosmetics Smiley Wanted it to spur you and be easy to read.

You have a great one mate.
/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #4 on: July 09, 2011, 11:53:03 AM »

BTW I can walk through the regexs for you if they are confusing.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
tommytx
Expert
****
Offline Offline

Posts: 123


View Profile WWW
« Reply #5 on: July 09, 2011, 12:09:38 PM »

That is the neatest thing I have ever seen...the Warning... was this your idea or is that a forum selection...
More than once i have been writing an answer while someone else is actually answering... the same thing...
very nice.
Quote
Warning - while you were reading 2 new replies have been posted. You may wish to review your post.

More detailed explanation would be great... but I don't expect you to write reams... but I am sure a lot of us neophytes would benefit from a walk thru... I do spend a lot of time studying a lot of code and of course its easier with comments.... This is really a good example of a project that anyone could use in their code as it uses many examples of completing tasks with code...

But I also have another related question to complete the project...
Here is the php that is receiving the piped email... I am not sure how to get it into the buffer you set up.
As you can see its piping into the var $email.... so is this correct....
$buff = "<<<HTML" . $email . "HTML"
I am sure that is not right... but I am trying anyway..


#!/usr/bin/php -q
<?php
// read from stdin
$fd = fopen("php://stdin", "r");
$email = "";
while (!feof($fd))
{
$email .= fread($fd, 1024);
}
fclose($fd);


// Rest of the  code here..

?>


Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #6 on: July 09, 2011, 12:24:36 PM »

was this your idea or is that a forum selection...
Comes stock with  the forum software. Oh, uh, wait - no I thought it up! Yeah that's it! Wink


More detailed explanation would be great...
Alright here we go:
if (preg_match('/<b>([^<]*)<\/b>: (.*)/'$line$parts))
First, obviously, we're using the preg_match function accepting input from the variable $line and putting results into the array $parts. Note that preg_match returns a T/F if it finds the expression I've specified, so nothing is added to $output unless the line matches.

The regex reads like this:
  • Look for (move forward until) <b>
  • Collect whatever you see that is NOT a < (up until a <)
  • There must be a complete </b >: after the last collection
  • Grab whatever you find till the end of the buffer

The next one is almost exactly the same:
preg_match_all('/<b>([^<]*)<\/b>: (.*)<br \/>/U'$buff$parts);
... except that I can't use the end of line as a stopper - so I stop on <br /> and also use the ungreedy modifier, which ensures that the regex will ONLY collect up to <br /> rather than collecting the entire buffer up until the very last instance of <br />.


But I also have another related question to complete the project...
Here is the php that is receiving the piped email... I am not sure how to get it into the buffer you set up.
There's nothing special about it, it's simply a variable called $buff. You could easily just say $buff = $email, or rather than using $buff use $email as the variable you're working on.

I think what's throwing you is the HEREDOC syntax - a handy way of filling up a string variable. These two statements do exactly the same thing:
$aVar "this is a test\nof the emergency broadcasting system.\nIf this had been an actual emergency...";

$aVar = <<<TEXT
this is a test
of the emergency broadcasting system.
If this had been an actual emergency...
TEXT;

I'm sure the first line is no mystery to you. The HEREDOC syntax is defined with the <<< (here comes some text) followed immediately by a random identifier (I used TEXT in this case - it's up to you, I use that identifier to help me make code readable, like HTML or XML or CSS or JS or TEXT or SQL etc) followed by all the stuff you want in the variable followed on a line ALL BY ITSELF AND WITHOUT ANY WHITESPACE IN FRONT OF IT, the identifier again and a semicolon. Note that something like a tab in front of the closing identifier will syntax err.
« Last Edit: July 09, 2011, 12:31:35 PM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
tommytx
Expert
****
Offline Offline

Posts: 123


View Profile WWW
« Reply #7 on: July 09, 2011, 12:34:46 PM »

Thanks the <<< HEREDOC makes sense.. just that i have never used it before...
And I have found a lot of uses for the piping an email into a php file... if anyone is following along you can see the code to pipe the email into php using a standard in call.... really neat.... then you can parse the piped in email the way you see I did above... well better yet the way Perkiset did above... I mostly looked on.... and was getting the data using strpos and strstr and we all know how messy that crap is...


Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #8 on: July 09, 2011, 12:36:26 PM »

Perhaps the handiest thing (to me) about HEREDOC is that I can make things look the way I want them to, and don't need to escape quotes. For example:
$SQL = <<<SQL
select
	
thisfield,
	
thatfield,
	
anotherfield
from
	
thistable,
	
thattable
where
	
thisistrue and
	
thatistrue and
	
not (thisisfalse)
SQL;

or this difference between
$aVar "<a id=\"thelink\" href=\"/adir/apage.html\" class=\"$classStr\" style=\"stylingstuff\">link text</a>";

$aVar = <<<LINK
<a href="thelink" href="/adir/apage.html" class="$classStr" style="stylingstuff">link text</a>
LINK;

Although there is a tiny amount of additional overhead using HEREDOC, it is often useful in terms of both readability and syntax cleanliness.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #9 on: July 09, 2011, 12:38:32 PM »

Actually Tommy that's a very cool way of getting an email. Personally I use an imap class so that I can manipulate my mailbox, but that's pretty cool quick-and-dirty.

Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
tommytx
Expert
****
Offline Offline

Posts: 123


View Profile WWW
« Reply #10 on: July 09, 2011, 12:51:33 PM »

Oh! are you a pipe expert? I had a lot of problems stopping the server from sending errors back to the message sender and tried everything with no results... finally my host did something to stop it.. I think he somehow killed the error return but just for one file my index.php.. if I try to pipe to any other file the email sender gets an error..
Code:
#!/usr/bin/php -q
<?php
// read from stdin
$fd fopen("php://stdin""r");
$email "";
while (!
feof($fd))
{
$email .= fread($fd1024);
}
fclose($fd);

Supposedly the -q on the shebang line is supposed to stop errros but did not work for me...
So bottom line to this day I have to use the index.php to process and that is shitty...
For example i cannot use a file like idx.php must be index.php why I have no idea.. maybe I will ask my host if he is doing anything to suppress index.php errors... I doubt it but it stopped once I told him.... I have my own VPS Server.

#!/usr/bin/php -q

So you would not believe how wild what i am doing is... using the one index.php file for many different pipes...
For example:
idx@mydomain.com goes to mypipe/index.php
alert@mydomain.com goes to mypipe/index.php

How how do I process them separately...
for idx the subject line of the message is "Idx_has_a_job_for_you"
for alert the subj line of the messsge is "Alert_ has_a_job_for_you"

Then the email is processed for those words and if found uses an include php to process that particlular job..
I know what a dumb way to go... but until I find out what is causing the errors I have no choice... but works well.
But I really need to use a separate file for each pipe.. but cannot since anyother directory or file give a bounced email.... ain't that crazy.... maybe there is a pipe expert around... is that you.... maybe I should start a pipe thread... hee..hee..

Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #11 on: July 09, 2011, 01:05:17 PM »

If you've got your own VPS then I think it's time to switch things up a bit.

Rather than having your ISP hit you with emails, let them sit in your mail mailbox until you are ready to process them. IMO it is superior for you to process when you want to, rather than when your ISP wants to event you.

Start here: http://www.php.net/manual/en/book.imap.php#96414

that entire page is a great jumping off point.

In fact, if you've got your own VPS then you should have control of the website that is driving your leads ... in which case you could do the work of collecting the leads without even DEALING with an email.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
tommytx
Expert
****
Offline Offline

Posts: 123


View Profile WWW
« Reply #12 on: July 09, 2011, 01:19:23 PM »

Whoa! Not so fast... while I have my own VPS.. and until last week I had my own Windows VPS... but let it go since I really did not like it... it was suped up with all the latest microsoft servers and stuff... see I am what is called a member of the Microsoft Action Pack... which means for $299.00 per year  I have total access to every single software microsoft owns... All the servers and all the programming software... its what they do to try to get us techs to sell their stuff... I have never sold any..but have used the hell out of it..
But problem is I design a lot of real estate websites... and set up a lot of idx systems and many of the idx have the actual content on the idx site and not on my clients site... so I have no choice but use what they allow... now I can have them come to a mail box and process them in batches.. however normally we like to process as the arrive to get and agent on it right away..  But I will look at what you sent and see if I might use it to improve my operations.
As I have total access to the VPS control panel and the mail servers..hands on.. but not too knowledgable in all that.. but am a quick learner....

Still got the linux server but got rid of the Windows VPS... I am a php lover and Windows Servers is more learning curver than I have time for... but did love the ability to program and run IE and Firefox and have greasemonkey driving it while I sleep... without my local desktop being turned on... A desktop in the sky is a nice idea..

« Last Edit: July 09, 2011, 01:22:11 PM by tommytx » Logged
tommytx
Expert
****
Offline Offline

Posts: 123


View Profile WWW
« Reply #13 on: July 09, 2011, 01:53:10 PM »

Wow! That IMAP stuff looks interesting... I could do a lot of stuff with that but it also looks like I might have to find a more basic book to read before I jump into that with both feet..
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #14 on: July 09, 2011, 03:04:37 PM »

The Windoz stuff will actually make you dumber, rather than smarter and more empowered.

IMO you should get your web stuff on to an Apache server in your own Linux VPS - then you would have 100% control and we could have some much more interesting discussions. A site that collects and delivers leads is really not as complicated as Windows and the system (it sounds like) you have requires.

Perhaps an interesting thing would be for you to start a thread about what you're doing, how you'd like to support your business ventures, what you know and don't know, and let the folks here comment. You might find that answers are more simple than you assume, and may be simply an exercising in stepping left of the problem, rather than forward into it.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Pages: [1] 2
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!