Grabbing Neat URLs With Regex

To The Live Thread

Thread: Grabbing Neat URLs With Regex

Back to category: Regular Expressions

Caligula

here is the

regex

I am using to pull URLs...

'/"(http://[^0-9].+?)"/'

The problem is, the urls coming back are like:

http://www.perkiset.org
http://www.perkiset.org/forum/index1.

php

http://www.perkiset.org/forum/index2.

php

...etc...

I want to be able to just grab the URLs like:

http://www.perkiset.org

and nothing more...( no deep links! ) but I cant get my

regex

to just grab the home page.. I have tried everything... it has to be possible to do...right?

nop_90

the .+? is a nongreedy match
that mean it matches up till u get stop char.
in ur case the stop char is <>"
try making the stop char <>/
<>"(http://.*?)/
try a variant of that.

perkiset

Yeah... I'd go more like this:

preg_match_all('/(http://[^/]+)/i', $inputBuff, $outArr);

... tell the

regex

to go until it finds another forward slash... and the /i modifier at the end will make the HTTP part in front case insensitive.

/p

nutballs

i dont think that will work perk, at least in all possible situations. if your being fed URLs, sure. but im guessing Cal is using this for more "nefarious" activities.

(https?://[A-Z0-9.-]+)

roll that to your own version of

regex

. this is windows and might be slightly different.

matches both http and https
matches any domain, ended by ANY invalid domainname character.

perkiset

thinking you're right NBs... mine assumes a lot... yours is tighter.

Do remember to add the /i at the end to make the http part case-insensite (at least in

php

perkiset

quote author=Caligula link=topic=196.msg1171#msg1171 date=1178867092

http://www.perkiset.org
http://www.perkiset.org/forum/index1.

php

http://www.perkiset.org/forum/index2.

php

but hey now... what are you up to? Applause

nutballs

lol

Caligula

quote author=nutballs link=topic=196.msg1195#msg1195 date=1178902227

i dont think that will work perk, at least in all possible situations. if your being fed URLs, sure. but im guessing Cal is using this for more "nefarious" activities.

Nothing nefarious... I am

learn

ing

PHP

/cURL/MySQL and my interest is mainly in spidering/link spamming... Just starting off with small basic tasks and then expanding as I

learn

how to write the code....

quote author=perkiset link=topic=196.msg1198#msg1198 date=1178903562

quote author=Caligula link=topic=196.msg1171#msg1171 date=1178867092

http://www.perkiset.org
http://www.perkiset.org/forum/index1.

php

http://www.perkiset.org/forum/index2.

php

but hey now... what are you up to? Applause

Sorry Perk... I was just using the domain as an example of what I am trying to do with the

regex

...

Thanks for the help guys.. I'm going to try it all now....

perkiset

uh HUH: That's what I've been telling G/Y/M for years bucko... Applause

Caligula

I'm not!

I dont have the skill nor the desire to scrape content and regenerate..etc... its too much work and with all the work I have with the dozen WH sites I have I cant imagine the work 100+ sites would be... no thanks. I'd rather just paint my white hats grey..... if ya catch my drift... besides.. my first concern is bringing in traffic... which is a major problem for me at this point... nevertheless... as for the

regex

... I want to thank you all for your input... I'd be lost without you guys... especially considering I only ask a question after 1000 google searches fail.....here is the final

regex

:

'/(http://w{3}.+?.[a-z]{3})/'

It cleans up the URLs real nice and cuts them off perfectly.. except for foreign domains... because as you can tell the

regex

only works for 3 letter domains (.com .org

.net

...etc) so when I get a www.perkiset.nl ..it follows the full URL... but even with that.. it still leaves the database looking clean.... and as a plus... the spider is running 10 times faster now.....

Since I'm still new to this... I did notice one thing.... the script runs even faster if I don't echo out the urls... for example.. when I ran it (before the

regex

change)... a 10 second run brought back 330+ URLs... and echoed all the URLs and then some to the screen..... but when I commented out the echo commands.... cleared the DB... the same 10 sec run brought back over 800+ URLs ....

I thought it was interesting... I'm rambling... I am shutting up now.... Thanks again guys.. I appreciate the help... Applause

thedarkness

<s*as+[^>]*hrefs*=s*["'](http|https|ftp)://(.*?)["'/]

i.e.

regex

= "@<s*as+[^>]*hrefs*=s*["'](http|https|ftp)://(.*?)["'/]@is";

preg_match_all( $

regex

, $target, $matches );

not exhaustively tested Applause

Cheers,
td

[edit]Added "context"[/edit]

Caligula

quote author=thedarkness link=topic=196.msg1231#msg1231 date=1178934343

<s*as+[^>]*hrefs*=s*["'](http|https|ftp)://(.*?)["'/]

not exhaustively tested Applause

Cheers,
td

damn dark...thats one hell of an expression... whats that grab every single link it finds?

thedarkness

Yep, strips just the FQDN out of just about any anchor tag. Exactly what you are after I believe, works with .com, .org,

.net

, .co.uk, .nl, .whatever

Cheers,
td

Caligula

Cool.. thanks dark..

One thing I realized about the one I wrote...

'/(http://w{3}.+?.[a-z]{3})/'

According to the "rules" the {3} part should be exact.. right? If its not exactly 3 letters it should just ignore it right? But it doesn't, cause like I said it picks up 2 letter domains too....

Bompa

quote author=Caligula link=topic=196.msg1234#msg1234 date=1178937609

If I remember correctly, the curly braces quantifier takes two parameters: minimum and maximum.

I would try {3,3}

good luck,
Bompa

Caligula

Thanks Bompa... you're right....

Its pretty much working the way I wanted to now... with the exception of a few stragglers....maybe I am expecting too much from the script... but at least the DB is not being filled with all the shit URLs now....

All this time and energy on a script that is pretty much useless at this point..lol sad isnt it..

thedarkness

quote author=Caligula link=topic=196.msg1234#msg1234 date=1178937609

'/(http://w{3}.+?.[a-z]{3})/'

According to the "rules" the {3} part should be exact.. right? If its not exactly 3 letters it should just ignore it right? But it doesn't, cause like I said it picks up 2 letter domains too....

This is an overuse of "." for what you want. A ".nl" does not match ".[a-z]{3}" so it then looks to see if it can include that and the rest of the url in ".+?" so if it can find a ".

asp

" or something further along in the url it will match.

Examples (I had a file laying around that had some myspace links in it, God knows where that came from :

The part that matches ".+?" is in red, the part that matches ".[a-z]{3}" in green;

http://www.myspace.com
http://www.myspace.nl">MySpace.com
http://www.myspace.nl/Modules/Help/Pages/HelpCenter.

asp

As you can see onlyu the first is what you want. You can get around this by eliminating the "." (match anything) and changing it to "[^.]" (match anything but literal "." Applause

.

Oh, just realised that we need an escaped "." as well which is obvious and prolly should have been there originally. So the "."s above after the www should be in red as well. So we end up with;

(http://w{3}.[^.]+?.[a-z]{3})

Match anything that meets the following criteria;

1: Starts with "http://www"
2: Followed by a literal "."
3: Followed by 1 or more chars that do not include "."
4: Followed by a literal "."
5: Followed by any (exactly) three alphabetic chars

quote author=Bompa link=topic=196.msg1235#msg1235 date=1178938022

I would try {3,3}

1: {3} Exactly 3
2: {3,} 3 or more
3: {3,3} Minimum of 3, maximum of 3 (effectively same as 1)

Cheers,
td

thedarkness

quote author=Caligula link=topic=196.msg1237#msg1237 date=1178939925

maybe I am expecting too much from the script...

TD's law #1 "Always expect too much from the script" Applause

Cheers,
td

nutballs

why is this

regex

getting so complicated?
dont you just want any full domainname from HREFs? if all the domainnames you are ever going to pull are preceded by the protocol, http,https,or even ftp, and you dont want anything after the end of the domain, then this is not hard and doesnt require all the extra gesticulations of character counts and such.
or did i miss something?

my

regex

above does exactly that, though not the ftp protocol, but thats a minor addition.
since valid domains are only 0-9 a-z . -
anything after that which would ap

pear

would terminate the domain. / ? space , "
especially if your pulling out of HREFs.

there are situations where it can fail, but only in content, like www.domain.com. would capture the domain, plus the last period, but that would only occur in content at the end of a sentence, not in HREFs. in content though, you have WAY too many possible issues to write a consistent scraper. no www or protocol preceding the domain is the toughest issue, automatically making it really tough.

Applause

thedarkness

Of course you're right Nuts. Yours works beautifully and is simple and succinct.

My first post was a

regex

that takes into account a lot of other factors which Caligula is not concerned about.

My subsequent post have been more along the lines of a guide to how

regex

's work and an example of the thought processes involved in "tightening" one up, rather than trying to provide the "killer"

regex

. I think each has an application even if only as a

teach

ing aid.

I guess we are muddying the waters though, so yes, Calig should be using yours ideally. Lowest common denominator is always preferrable but sometimes you just use what works..... Applause

Cheers,
td

perkiset

quote author=Bompa link=topic=196.msg1235#msg1235 date=1178938022

If I remember correctly, the curly braces quantifier takes two parameters: minimum and maximum.
I would try {3,3}

<aside>
Hey Bomps - you're 50% correct - if you curly brace a single number then it means you want *exactly* that count - if you pass two params then it's a {min,max} thang
</aside>

grandpa

w00t!
it seems i start to understand

regex

...

preg_match_all("/http://(?:www.|)([a-z]+.(?:com

|net

|org|info))/", $input, $output);

works great for me... Applause

Caligula

quote author=thedarkness link=topic=196.msg1238#msg1238 date=1178942929

quote author=Caligula link=topic=196.msg1234#msg1234 date=1178937609

This is an overuse of "." for what you want. A ".nl" does not match ".[a-z]{3}" so it then looks to see if it can include that and the rest of the url in ".+?" so if it can find a ".

asp

" or something further along in the url it will match.

Examples (I had a file laying around that had some myspace links in it, God knows where that came from :

The part that matches ".+?" is in red, the part that matches ".[a-z]{3}" in green;

http://www.myspace.com
http://www.myspace.nl">MySpace.com
http://www.myspace.nl/Modules/Help/Pages/HelpCenter.

asp

As you can see onlyu the first is what you want. You can get around this by eliminating the "." (match anything) and changing it to "[^.]" (match anything but literal "." Applause

quote author=Bompa link=topic=196.msg1235#msg1235 date=1178938022

I would try {3,3}

1: {3} Exactly 3
2: {3,} 3 or more
3: {3,3} Minimum of 3, maximum of 3 (effectively same as 1)

Cheers,
td

Dark - it worked Perfect! Your myspace example ( next time use Perks domain as the example Applause

) Is exactly what was getting thru... all fixed now... thanks bro...

Nutballs - I know I'm being a pain in the ass.. but I was trying to get a very specific result... I have plans for scripts that will rely heavily on this particular area... I purposely cut out a lot of things in the expression basically just to see how specific you can get with

regex

...

I try not to ask too many questions... like I said before if I ask a question about something you can bet I have been trying to work it out for at least 3 days before I break down and ask.... sorry if I am being a bother.....

perkiset

quote author=grandpa link=topic=196.msg1244#msg1244 date=1178947049

it seems i start to understand

regex

...

preg_match_all("/http://(?:www.|)([a-z]+.(?:com

|net

|org|info))/", $input, $output);

Nah you suck. Applause

TIPS:
* I'd not put www at the front unless you do not want other subdomains
* I'd add 0-9, period dash and underscore to the char list [a-z]
* I'd change the last optionals [com

/net

and such] to [A-Z]{2,4} because of .uk and .info
* I'd add i as a modifier preg_match_all('/[your

regex

here]/i', $input, $output) so that it's case insensitive.

But you clearly are getting it. May I suggest

Regular Expression

s - The Complete

Tutor

ial by Jan Goyvaerts - it's a rocking good

tutor

ial and will get you all the way there. You can order it or buy a PDF ... even at 3am. You'll have to google it because I don't remember where his site is...

/p

perkiset

quote author=Caligula link=topic=196.msg1245#msg1245 date=1178947308

I try not to ask too many questions... like I said before if I ask a question about something you can bet I have been trying to work it out for at least 3 days before I break down and ask.... sorry if I am being a bother.....

... easy on the caffeine there pal - that's why the Cache is here. Go make lots of cash and make us all look bad.

/p

grandpa

quote author=perkiset link=topic=196.msg1246#msg1246 date=1178947511

quote author=grandpa link=topic=196.msg1244#msg1244 date=1178947049

it seems i start to understand

regex

...

preg_match_all("/http://(?:www.|)([a-z]+.(?:com

|net

|org|info))/", $input, $output);

Nah you suck. Applause

TIPS:
* I'd not put www at the front unless you do not want other subdomains
* I'd add 0-9, period dash and underscore to the char list [a-z]
* I'd change the last optionals [com

/net

and such] to [A-Z]{2,4} because of .uk and .info
* I'd add i as a modifier preg_match_all('/[your

regex

here]/i', $input, $output) so that it's case insensitive.

But you clearly are getting it. May I suggest

Regular Expression

s - The Complete

Tutor

ial by Jan Goyvaerts - it's a rocking good

tutor

ial and will get you all the way there. You can order it or buy a PDF ... even at 3am. You'll have to google it because I don't remember where his site is...

/p

thx perk, would you like to send me a copy Applause

thx Caligula, it is revolution thread that made me jump to next step of understanding about

regex

, after tried to solve your need. maybe i will never be understand

regex

if this thread doesnt exist Applause

one question perk,
i see out there, there are some modifier
"#[

regex

]#"
"/[

regex

]/"
"@[

regex

]@"

what difference?

nutballs

quote author=Caligula link=topic=196.msg1245#msg1245 date=1178947308

Nutballs - I know I'm being a pain in the ass.. but I was trying to get a very specific result... I have plans for scripts that will rely heavily on this particular area... I purposely cut out a lot of things in the expression basically just to see how specific you can get with

regex

bah, dont be such a pansy, your not being a pain, your trying to

learn

. apparently there is more to your needs that what my

regex

covers, i was just trying to understand if there was more to your question or not. I thought you just meant grabbing domains from links (which was my target with that

regex

. If your doing more than that, it does get more complicated, and possibly dicey.

but of course, the point here is to

learn

and share, nothing was meant by my post. i didnt realize it took a turn towards the "general conversation" about

regex

. just was trying to see the whole picture. and there are no dumb questions, just people who are too dumb to ask them.

thedarkness

quote author=grandpa link=topic=196.msg1248#msg1248 date=1178948652

one question perk,
i see out there, there are some modifier
"#[

regex

]#"
"/[

regex

]/"
"@[

regex

]@"

what difference?

None really.

The # or / or @ are just the character you choose (arbitrary) to be the "border" of your

regex

. It seperates the

regex

from the modifiers. I think the term is delimeter?

Cheers,
td

[edit]I know what a delimiter is, of course, just wasn't sure whether it was the correct term in this case, but I looked it up... and it is... so there ya go....[/edit]

perkiset

I think your right TD although I always see

PHP

regex

surrounded by '/' as we ll as in

javascript

... but just cause you posted that I gotta go give it a try in both...

thedarkness

I always use "@" myself as I find I have to escape "/" too often.

e.g.

regex

= "@<s*as+[^>]*hrefs*=s*["'](http|https|ftp)://(.*?)["'/]@is";

preg_match_all( $

regex

, $target, $matches );

Cheers,
td

perkiset

That's fishing great. FishING great. Din't know that lol... I have to escape the hell out of / as well. Pisses me off.

Awesome!

Cheersbackatcha
/p

Caligula

http://www.perkiset.org/forum/

php

/quick_dirty_url_spider-t206.0.html

Thread Categories

		Best of The Cache Home
		Search The Cache