The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 17, 2019, 06:38:01 PM

Login with username, password and session length


Pages: [1] 2 3
  Print  
Author Topic: Grabbing Neat URLs With Regex  (Read 13174 times)
Caligula
Rookie
**
Offline Offline

Posts: 39



View Profile
« on: May 11, 2007, 12:04:52 AM »

here is the regex I am using to pull URLs...

'/"(http:\/\/[^0-9].+?)"/'


The problem is, the urls coming back are like:

http://www.perkiset.org
http://www.perkiset.org/forum/index1.php
http://www.perkiset.org/forum/index2.php

...etc...

I want to be able to just grab the URLs like:

http://www.perkiset.org

and nothing more...( no deep links! ) but I cant get my regex to just grab the home page.. I have tried everything... it has to be possible to do...right?

Logged
nop_90
Global Moderator
Lifer
*****
Offline Offline

Posts: 2203


View Profile
« Reply #1 on: May 11, 2007, 02:04:59 AM »

the .+? is a nongreedy match
that mean it matches up till u get stop char.
in ur case the stop char is "
try making the stop char /
"(http://.*?)/
try a variant of that.

Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #2 on: May 11, 2007, 08:58:33 AM »

Yeah... I'd go more like this:

preg_match_all('/(http:\/\/[^\/]+)/i', $inputBuff, $outArr);

... tell the regex to go until it finds another forward slash... and the /i modifier at the end will make the HTTP part in front case insensitive.

/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #3 on: May 11, 2007, 09:50:27 AM »

i dont think that will work perk, at least in all possible situations. if your being fed URLs, sure. but im guessing Cal is using this for more "nefarious" activities.

(https?:\/\/[A-Z0-9.-]+)

roll that to your own version of regex. this is windows and might be slightly different.

matches both http and https
matches any domain, ended by ANY invalid domainname character.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #4 on: May 11, 2007, 10:11:23 AM »

thinking you're right NBs... mine assumes a lot... yours is tighter.

Do remember to add the /i at the end to make the http part case-insensite (at least in php)
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #5 on: May 11, 2007, 10:12:42 AM »


but hey now... what are you up to? Don't make me...  ROFLMAO
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #6 on: May 11, 2007, 10:16:12 AM »

lol
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
Caligula
Rookie
**
Offline Offline

Posts: 39



View Profile
« Reply #7 on: May 11, 2007, 02:41:45 PM »

i dont think that will work perk, at least in all possible situations. if your being fed URLs, sure. but im guessing Cal is using this for more "nefarious" activities.

Nothing nefarious... I am learning PHP/cURL/MySQL and my interest is mainly in spidering/link spamming... Just starting off with small basic tasks and then expanding as I learn how to write the code....


ROFLMAO Sorry Perk... I was just using the domain as an example of what I am trying to do with the regex... 

Thanks for the help guys.. I'm going to try it all now....

Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #8 on: May 11, 2007, 02:49:32 PM »

uh HUH: That's what I've been telling G/Y/M for years bucko...  Don't make me...

 Wink
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Caligula
Rookie
**
Offline Offline

Posts: 39



View Profile
« Reply #9 on: May 11, 2007, 06:08:08 PM »

I'm not!  ROFLMAO I dont have the skill nor the desire to scrape content and regenerate..etc... its too much work and with all the work I have with the dozen WH sites I have I cant imagine the work 100+ sites would be...  no thanks. I'd rather just paint my white hats grey.....  if ya catch my drift... besides.. my first concern is bringing in traffic... which is a major problem for me at this point... nevertheless... as for the regex... I want to thank you all for your input... I'd be lost without you guys... especially considering I only ask a question after 1000 google searches fail.....here is the final regex:

'/(http:\/\/w{3}.+?\.[a-z]{3})/'

It cleans up the URLs real nice and cuts them off perfectly.. except for foreign domains... because as you can tell the regex only works for 3 letter domains (.com .org .net...etc) so when I get a www.perkiset.nl ..it follows the full URL... but even with that.. it still leaves the database looking clean.... and as a plus... the spider is running 10 times faster now.....

Since I'm still new to this... I did notice one thing.... the script runs even faster if I don't echo out the urls...  for example.. when I ran it (before the regex change)... a 10 second run brought back 330+ URLs... and echoed all the URLs and then some to the screen.....  but when I commented out the echo commands.... cleared the DB... the same 10 sec run brought back over 800+ URLs .... 

I thought it was interesting...  I'm rambling... I am shutting up now.... Thanks again guys.. I appreciate the help... Mobster



 
Logged
thedarkness
Lifer
*****
Offline Offline

Posts: 585



View Profile
« Reply #10 on: May 11, 2007, 06:45:43 PM »

<\s*a\s+[^>]*href\s*=\s*[\"'](http|https|ftp)://(.*?)[\"'/]

i.e.
Code:
$regex = "@<\s*a\s+[^>]*href\s*=\s*[\"'](http|https|ftp)://(.*?)[\"'/]@is";

preg_match_all( $regex, $target, $matches );


not exhaustively tested  ROFLMAO

Cheers,
td

[edit]Added "context"[/edit]
« Last Edit: May 11, 2007, 06:59:57 PM by thedarkness » Logged

"I want to be the guy my dog thinks I am."
 - Unknown
Caligula
Rookie
**
Offline Offline

Posts: 39



View Profile
« Reply #11 on: May 11, 2007, 06:59:03 PM »

<\s*a\s+[^>]*href\s*=\s*[\"'](http|https|ftp)://(.*?)[\"'/]

not exhaustively tested  ROFLMAO

Cheers,
td


damn dark...thats one hell of an expression... whats that grab every single link it finds?
Logged
thedarkness
Lifer
*****
Offline Offline

Posts: 585



View Profile
« Reply #12 on: May 11, 2007, 07:02:57 PM »

Yep, strips just the FQDN out of just about any anchor tag. Exactly what you are after I believe, works with .com, .org, .net, .co.uk, .nl, .whatever

Cheers,
td
Logged

"I want to be the guy my dog thinks I am."
 - Unknown
Caligula
Rookie
**
Offline Offline

Posts: 39



View Profile
« Reply #13 on: May 11, 2007, 07:40:09 PM »

Cool.. thanks dark..

One thing I realized about the one I wrote...

'/(http:\/\/w{3}.+?\.[a-z]{3})/'

According to the "rules" the {3} part should be exact.. right? If its not exactly 3 letters it should just ignore it right? But it doesn't, cause like I said it picks up 2 letter domains too....

Logged
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #14 on: May 11, 2007, 07:47:02 PM »

Cool.. thanks dark..

One thing I realized about the one I wrote...

'/(http:\/\/w{3}.+?\.[a-z]{3})/'

According to the "rules" the {3} part should be exact.. right? If its not exactly 3 letters it should just ignore it right? But it doesn't, cause like I said it picks up 2 letter domains too....



If I remember correctly, the curly braces quantifier takes two parameters: minimum and maximum.

I would try {3,3}


good luck,
Bompa
Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
Pages: [1] 2 3
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!