|
|
|
nop_90
|
 |
« Reply #1 on: May 11, 2007, 02:04:59 AM » |
|
the .+? is a nongreedy match that mean it matches up till u get stop char. in ur case the stop char is " try making the stop char / "(http://.*?)/ try a variant of that.
|
|
|
|
|
Logged
|
|
|
|
|
perkiset
|
 |
« Reply #2 on: May 11, 2007, 08:58:33 AM » |
|
Yeah... I'd go more like this:
preg_match_all('/(http:\/\/[^\/]+)/i', $inputBuff, $outArr);
... tell the regex to go until it finds another forward slash... and the /i modifier at the end will make the HTTP part in front case insensitive.
/p
|
|
|
|
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
nutballs
|
 |
« Reply #3 on: May 11, 2007, 09:50:27 AM » |
|
i dont think that will work perk, at least in all possible situations. if your being fed URLs, sure. but im guessing Cal is using this for more "nefarious" activities.
(https?:\/\/[A-Z0-9.-]+)
roll that to your own version of regex. this is windows and might be slightly different.
matches both http and https matches any domain, ended by ANY invalid domainname character.
|
|
|
|
|
Logged
|
I could eat a bowl of Alphabet Soup and shit a better argument than that.
|
|
|
|
perkiset
|
 |
« Reply #4 on: May 11, 2007, 10:11:23 AM » |
|
thinking you're right NBs... mine assumes a lot... yours is tighter.
Do remember to add the /i at the end to make the http part case-insensite (at least in php)
|
|
|
|
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
perkiset
|
 |
« Reply #5 on: May 11, 2007, 10:12:42 AM » |
|
but hey now... what are you up to? 
|
|
|
|
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
nutballs
|
 |
« Reply #6 on: May 11, 2007, 10:16:12 AM » |
|
lol
|
|
|
|
|
Logged
|
I could eat a bowl of Alphabet Soup and shit a better argument than that.
|
|
|
|
Caligula
|
 |
« Reply #7 on: May 11, 2007, 02:41:45 PM » |
|
i dont think that will work perk, at least in all possible situations. if your being fed URLs, sure. but im guessing Cal is using this for more "nefarious" activities.
Nothing nefarious... I am learning PHP/cURL/MySQL and my interest is mainly in spidering/link spamming... Just starting off with small basic tasks and then expanding as I learn how to write the code.... but hey now... what are you up to?   Sorry Perk... I was just using the domain as an example of what I am trying to do with the regex... Thanks for the help guys.. I'm going to try it all now....
|
|
|
|
|
Logged
|
|
|
|
|
perkiset
|
 |
« Reply #8 on: May 11, 2007, 02:49:32 PM » |
|
uh HUH: That's what I've been telling G/Y/M for years bucko... 
|
|
|
|
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
Caligula
|
 |
« Reply #9 on: May 11, 2007, 06:08:08 PM » |
|
I'm not!  I dont have the skill nor the desire to scrape content and regenerate..etc... its too much work and with all the work I have with the dozen WH sites I have I cant imagine the work 100+ sites would be... no thanks. I'd rather just paint my white hats grey..... if ya catch my drift... besides.. my first concern is bringing in traffic... which is a major problem for me at this point... nevertheless... as for the regex... I want to thank you all for your input... I'd be lost without you guys... especially considering I only ask a question after 1000 google searches fail.....here is the final regex: '/(http:\/\/w{3}.+?\.[a-z]{3})/' It cleans up the URLs real nice and cuts them off perfectly.. except for foreign domains... because as you can tell the regex only works for 3 letter domains (.com .org .net...etc) so when I get a www.perkiset.nl ..it follows the full URL... but even with that.. it still leaves the database looking clean.... and as a plus... the spider is running 10 times faster now..... Since I'm still new to this... I did notice one thing.... the script runs even faster if I don't echo out the urls... for example.. when I ran it (before the regex change)... a 10 second run brought back 330+ URLs... and echoed all the URLs and then some to the screen..... but when I commented out the echo commands.... cleared the DB... the same 10 sec run brought back over 800+ URLs .... I thought it was interesting... I'm rambling... I am shutting up now.... Thanks again guys.. I appreciate the help...
|
|
|
|
|
Logged
|
|
|
|
|
thedarkness
|
 |
« Reply #10 on: May 11, 2007, 06:45:43 PM » |
|
<\s*a\s+[^>]*href\s*=\s*[\"'](http|https|ftp)://(.*?)[\"'/] i.e. $regex = "@<\s*a\s+[^>]*href\s*=\s*[\"'](http|https|ftp)://(.*?)[\"'/]@is";
preg_match_all( $regex, $target, $matches );
not exhaustively tested  Cheers, td [edit]Added "context"[/edit]
|
|
|
|
« Last Edit: May 11, 2007, 06:59:57 PM by thedarkness »
|
Logged
|
"I want to be the guy my dog thinks I am." - Unknown
|
|
|
|
Caligula
|
 |
« Reply #11 on: May 11, 2007, 06:59:03 PM » |
|
<\s*a\s+[^>]*href\s*=\s*[\"'](http|https|ftp)://(.*?)[\"'/] not exhaustively tested  Cheers, td damn dark...thats one hell of an expression... whats that grab every single link it finds?
|
|
|
|
|
Logged
|
|
|
|
|
thedarkness
|
 |
« Reply #12 on: May 11, 2007, 07:02:57 PM » |
|
Yep, strips just the FQDN out of just about any anchor tag. Exactly what you are after I believe, works with .com, .org, .net, .co.uk, .nl, .whatever
Cheers, td
|
|
|
|
|
Logged
|
"I want to be the guy my dog thinks I am." - Unknown
|
|
|
|
Caligula
|
 |
« Reply #13 on: May 11, 2007, 07:40:09 PM » |
|
Cool.. thanks dark..
One thing I realized about the one I wrote...
'/(http:\/\/w{3}.+?\.[a-z]{3})/'
According to the "rules" the {3} part should be exact.. right? If its not exactly 3 letters it should just ignore it right? But it doesn't, cause like I said it picks up 2 letter domains too....
|
|
|
|
|
Logged
|
|
|
|
|
Bompa
|
 |
« Reply #14 on: May 11, 2007, 07:47:02 PM » |
|
Cool.. thanks dark..
One thing I realized about the one I wrote...
'/(http:\/\/w{3}.+?\.[a-z]{3})/'
According to the "rules" the {3} part should be exact.. right? If its not exactly 3 letters it should just ignore it right? But it doesn't, cause like I said it picks up 2 letter domains too....
If I remember correctly, the curly braces quantifier takes two parameters: minimum and maximum. I would try {3,3} good luck, Bompa
|
|
|
|
|
Logged
|
"Everything that can be counted does not necessarily count; everything that counts cannot necessarily be counted." -- Albert Einstein
|
|
|
|