 |
Caligula
here is the regex I am using to pull URLs... '/"(http://[^0-9].+?)"/' The problem is, the urls coming back are like: http://www.perkiset.org http://www.perkiset.org/forum/index1. php http://www.perkiset.org/forum/index2. php ...etc... I want to be able to just grab the URLs like: http://www.perkiset.org and nothing more...( no deep links! ) but I cant get my regex to just grab the home page.. I have tried everything... it has to be possible to do...right?
nop_90
the .+? is a nongreedy match that mean it matches up till u get stop char. in ur case the stop char is <>"> try making the stop char <>/> <>"(http://.*?)/> try a variant of that.
perkiset
Yeah... I'd go more like this: preg_match_all('/(http://[^/]+)/i', $inputBuff, $outArr); ... tell the regex to go until it finds another forward slash... and the /i modifier at the end will make the HTTP part in front case insensitive. /p
nutballs
i dont think that will work perk, at least in all possible situations. if your being fed URLs, sure. but im guessing Cal is using this for more "nefarious" activities. (https?://[A-Z0-9.-]+) roll that to your own version of regex . this is windows and might be slightly different. matches both http and https matches any domain, ended by ANY invalid domainname character.
perkiset
thinking you're right NBs... mine assumes a lot... yours is tighter. Do remember to add the /i at the end to make the http part case-insensite (at least in php 
perkiset
quote author=Caligula link=topic=196.msg1171#msg1171 date=1178867092 http://www.perkiset.org http://www.perkiset.org/forum/index1.
php http://www.perkiset.org/forum/index2.php
but hey now... what are you up to?
Caligula
quote author=nutballs link=topic=196.msg1195#msg1195 date=1178902227 i dont think that will work perk, at least in all possible situations. if your being fed URLs, sure. but im guessing Cal is using this for more "nefarious" activities.
Nothing nefarious... I am learn ing PHP /cURL/MySQL and my interest is mainly in spidering/link spamming... Just starting off with small basic tasks and then expanding as I learn how to write the code.... quote author=perkiset link=topic=196.msg1198#msg1198 date=1178903562 quote author=Caligula link=topic=196.msg1171#msg1171 date=1178867092 http://www.perkiset.org http://www.perkiset.org/forum/index1.
php http://www.perkiset.org/forum/index2.php
but hey now... what are you up to?   Sorry Perk... I was just using the domain as an example of what I am trying to do with the regex ... Thanks for the help guys.. I'm going to try it all now....
perkiset
uh HUH: That's what I've been telling G/Y/M for years bucko...
Caligula
I'm not!  I dont have the skill nor the desire to scrape content and regenerate..etc... its too much work and with all the work I have with the dozen WH sites I have I cant imagine the work 100+ sites would be... no thanks. I'd rather just paint my white hats grey..... if ya catch my drift... besides.. my first concern is bringing in traffic... which is a major problem for me at this point... nevertheless... as for the regex ... I want to thank you all for your input... I'd be lost without you guys... especially considering I only ask a question after 1000 google searches fail.....here is the final regex : '/(http://w{3}.+?.[a-z]{3})/' It cleans up the URLs real nice and cuts them off perfectly.. except for foreign domains... because as you can tell the regex only works for 3 letter domains (.com .org .net ...etc) so when I get a www.perkiset.nl ..it follows the full URL... but even with that.. it still leaves the database looking clean.... and as a plus... the spider is running 10 times faster now..... Since I'm still new to this... I did notice one thing.... the script runs even faster if I don't echo out the urls... for example.. when I ran it (before the regex change)... a 10 second run brought back 330+ URLs... and echoed all the URLs and then some to the screen..... but when I commented out the echo commands.... cleared the DB... the same 10 sec run brought back over 800+ URLs .... I thought it was interesting... I'm rambling... I am shutting up now.... Thanks again guys.. I appreciate the help...
thedarkness
<s*as+[^>]*hrefs*=s*["'](http|https|ftp)://(.*?)["'/] i.e. $
regex = "@<s*as+[^>]*hrefs*=s*["'](http|https|ftp)://(.*?)["'/]@is";
preg_match_all( $regex , $target, $matches );
not exhaustively tested  Cheers, td [edit]Added "context"[/edit]
Caligula
quote author=thedarkness link=topic=196.msg1231#msg1231 date=1178934343 <s*as+[^>]*hrefs*=s*["'](http|https|ftp)://(.*?)["'/] not exhaustively tested  Cheers, td damn dark...thats one hell of an expression... whats that grab every single link it finds?
thedarkness
Yep, strips just the FQDN out of just about any anchor tag. Exactly what you are after I believe, works with .com, .org, .net , .co.uk, .nl, .whatever Cheers, td
Caligula
Cool.. thanks dark.. One thing I realized about the one I wrote... '/(http://w{3}.+?.[a-z]{3})/' According to the "rules" the {3} part should be exact.. right? If its not exactly 3 letters it should just ignore it right? But it doesn't, cause like I said it picks up 2 letter domains too....
Bompa
quote author=Caligula link=topic=196.msg1234#msg1234 date=1178937609 Cool.. thanks dark..
One thing I realized about the one I wrote...
'/(http://w{3}.+?.[a-z]{3})/'
According to the "rules" the {3} part should be exact.. right? If its not exactly 3 letters it should just ignore it right? But it doesn't, cause like I said it picks up 2 letter domains too....
If I remember correctly, the curly braces quantifier takes two parameters: minimum and maximum. I would try {3,3} good luck, Bompa
Caligula
Thanks Bompa... you're right.... Its pretty much working the way I wanted to now... with the exception of a few stragglers....maybe I am expecting too much from the script... but at least the DB is not being filled with all the shit URLs now.... All this time and energy on a script that is pretty much useless at this point..lol sad isnt it..
thedarkness
quote author=Caligula link=topic=196.msg1234#msg1234 date=1178937609
'/(http://w{3}.+?.[a-z]{3})/'
According to the "rules" the {3} part should be exact.. right? If its not exactly 3 letters it should just ignore it right? But it doesn't, cause like I said it picks up 2 letter domains too....
This is an overuse of "." for what you want. A ".nl" does not match ".[a-z]{3}" so it then looks to see if it can include that and the rest of the url in ".+?" so if it can find a ". asp " or something further along in the url it will match. Examples (I had a file laying around that had some myspace links in it, God knows where that came from :  The part that matches ".+?" is in red, the part that matches ".[a-z]{3}" in green; http://www. myspace.comhttp://www. myspace.nl">MySpace.comhttp://www. myspace.nl/Modules/Help/Pages/HelpCenter.asp As you can see onlyu the first is what you want. You can get around this by eliminating the "." (match anything) and changing it to "[^.]" (match anything but literal "."  . Oh, just realised that we need an escaped "." as well which is obvious and prolly should have been there originally. So the "."s above after the www should be in red as well. So we end up with; (http://w{3}.[^.]+?.[a-z]{3}) Match anything that meets the following criteria; 1: Starts with "http://www" 2: Followed by a literal "." 3: Followed by 1 or more chars that do not include "." 4: Followed by a literal "." 5: Followed by any (exactly) three alphabetic chars quote author=Bompa link=topic=196.msg1235#msg1235 date=1178938022 I would try {3,3}
1: {3} Exactly 3 2: {3,} 3 or more 3: {3,3} Minimum of 3, maximum of 3 (effectively same as 1) Cheers, td
thedarkness
quote author=Caligula link=topic=196.msg1237#msg1237 date=1178939925 maybe I am expecting too much from the script...
 TD's law #1 "Always expect too much from the script"  Cheers, td
nutballs
why is this regex getting so complicated? dont you just want any full domainname from HREFs? if all the domainnames you are ever going to pull are preceded by the protocol, http,https,or even ftp, and you dont want anything after the end of the domain, then this is not hard and doesnt require all the extra gesticulations of character counts and such. or did i miss something? my regex above does exactly that, though not the ftp protocol, but thats a minor addition. since valid domains are only 0-9 a-z . - anything after that which would ap pear would terminate the domain. / ? space , " especially if your pulling out of HREFs. there are situations where it can fail, but only in content, like www.domain.com. would capture the domain, plus the last period, but that would only occur in content at the end of a sentence, not in HREFs. in content though, you have WAY too many possible issues to write a consistent scraper. no www or protocol preceding the domain is the toughest issue, automatically making it really tough.
thedarkness
Of course you're right Nuts. Yours works beautifully and is simple and succinct. My first post was a regex that takes into account a lot of other factors which Caligula is not concerned about. My subsequent post have been more along the lines of a guide to how regex 's work and an example of the thought processes involved in "tightening" one up, rather than trying to provide the "killer" regex . I think each has an application even if only as a teach ing aid. I guess we are muddying the waters though, so yes, Calig should be using yours ideally. Lowest common denominator is always preferrable but sometimes you just use what works.....  Cheers, td
perkiset
quote author=Bompa link=topic=196.msg1235#msg1235 date=1178938022 If I remember correctly, the curly braces quantifier takes two parameters: minimum and maximum. I would try {3,3}
<aside> Hey Bomps - you're 50% correct - if you curly brace a single number then it means you want *exactly* that count - if you pass two params then it's a {min,max} thang </aside>
grandpa
w00t! it seems i start to understand regex ... preg_match_all("/http://(?:www.|)([a-z]+.(?:com |net |org|info))/", $input, $output); works great for me...
Caligula
quote author=thedarkness link=topic=196.msg1238#msg1238 date=1178942929 quote author=Caligula link=topic=196.msg1234#msg1234 date=1178937609
'/(http://w{3}.+?.[a-z]{3})/'
According to the "rules" the {3} part should be exact.. right? If its not exactly 3 letters it should just ignore it right? But it doesn't, cause like I said it picks up 2 letter domains too....
This is an overuse of "." for what you want. A ".nl" does not match ".[a-z]{3}" so it then looks to see if it can include that and the rest of the url in ".+?" so if it can find a ". asp " or something further along in the url it will match. Examples (I had a file laying around that had some myspace links in it, God knows where that came from :  The part that matches ".+?" is in red, the part that matches ".[a-z]{3}" in green; http://www. myspace.comhttp://www. myspace.nl">MySpace.comhttp://www. myspace.nl/Modules/Help/Pages/HelpCenter.asp As you can see onlyu the first is what you want. You can get around this by eliminating the "." (match anything) and changing it to "[^.]" (match anything but literal "."  . Oh, just realised that we need an escaped "." as well which is obvious and prolly should have been there originally. So the "."s above after the www should be in red as well. So we end up with; (http://w{3}.[^.]+?.[a-z]{3}) Match anything that meets the following criteria; 1: Starts with "http://www" 2: Followed by a literal "." 3: Followed by 1 or more chars that do not include "." 4: Followed by a literal "." 5: Followed by any (exactly) three alphabetic chars quote author=Bompa link=topic=196.msg1235#msg1235 date=1178938022 I would try {3,3}
1: {3} Exactly 3 2: {3,} 3 or more 3: {3,3} Minimum of 3, maximum of 3 (effectively same as 1) Cheers, td Dark - it worked Perfect! Your myspace example ( next time use Perks domain as the example  ) Is exactly what was getting thru... all fixed now... thanks bro... Nutballs - I know I'm being a pain in the ass.. but I was trying to get a very specific result... I have plans for scripts that will rely heavily on this particular area... I purposely cut out a lot of things in the expression basically just to see how specific you can get with regex ... I try not to ask too many questions... like I said before if I ask a question about something you can bet I have been trying to work it out for at least 3 days before I break down and ask.... sorry if I am being a bother.....
perkiset
quote author=grandpa link=topic=196.msg1244#msg1244 date=1178947049 it seems i start to understand
regex ...
preg_match_all("/http://(?:www.|)([a-z]+.(?:com|net |org|info))/", $input, $output);
Nah you suck.  TIPS: * I'd not put www at the front unless you do not want other subdomains * I'd add 0-9, period dash and underscore to the char list [a-z] * I'd change the last optionals [com /net and such] to [A-Z]{2,4} because of .uk and .info * I'd add i as a modifier preg_match_all('/[your regex here]/i', $input, $output) so that it's case insensitive. But you clearly are getting it. May I suggest Regular Expression s - The Complete Tutor ial by Jan Goyvaerts - it's a rocking good tutor ial and will get you all the way there. You can order it or buy a PDF ... even at 3am. You'll have to google it because I don't remember where his site is... /p
perkiset
quote author=Caligula link=topic=196.msg1245#msg1245 date=1178947308 I try not to ask too many questions... like I said before if I ask a question about something you can bet I have been trying to work it out for at least 3 days before I break down and ask.... sorry if I am being a bother.....
... easy on the caffeine there pal - that's why the Cache is here. Go make lots of cash and make us all look bad. /p
grandpa
quote author=perkiset link=topic=196.msg1246#msg1246 date=1178947511 quote author=grandpa link=topic=196.msg1244#msg1244 date=1178947049 it seems i start to understand
regex ...
preg_match_all("/http://(?:www.|)([a-z]+.(?:com|net |org|info))/", $input, $output);
Nah you suck.  TIPS: * I'd not put www at the front unless you do not want other subdomains * I'd add 0-9, period dash and underscore to the char list [a-z] * I'd change the last optionals [com /net and such] to [A-Z]{2,4} because of .uk and .info * I'd add i as a modifier preg_match_all('/[your regex here]/i', $input, $output) so that it's case insensitive. But you clearly are getting it. May I suggest Regular Expression s - The Complete Tutor ial by Jan Goyvaerts - it's a rocking good tutor ial and will get you all the way there. You can order it or buy a PDF ... even at 3am. You'll have to google it because I don't remember where his site is... /p  thx perk, would you like to send me a copy  thx Caligula, it is revolution thread that made me jump to next step of understanding about regex , after tried to solve your need. maybe i will never be understand regex if this thread doesnt exist  one question perk, i see out there, there are some modifier "#[ regex ]#" "/[ regex ]/" "@[ regex ]@" what difference?
nutballs
quote author=Caligula link=topic=196.msg1245#msg1245 date=1178947308 Nutballs - I know I'm being a pain in the ass.. but I was trying to get a very specific result... I have plans for scripts that will rely heavily on this particular area... I purposely cut out a lot of things in the expression basically just to see how specific you can get with
regex ...
I try not to ask too many questions... like I said before if I ask a question about something you can bet I have been trying to work it out for at least 3 days before I break down and ask.... sorry if I am being a bother.....
bah, dont be such a pansy, your not being a pain, your trying to learn . apparently there is more to your needs that what my regex covers, i was just trying to understand if there was more to your question or not. I thought you just meant grabbing domains from links (which was my target with that regex   . If your doing more than that, it does get more complicated, and possibly dicey. but of course, the point here is to learn and share, nothing was meant by my post. i didnt realize it took a turn towards the "general conversation" about regex . just was trying to see the whole picture. and there are no dumb questions, just people who are too dumb to ask them.
thedarkness
quote author=grandpa link=topic=196.msg1248#msg1248 date=1178948652
one question perk, i see out there, there are some modifier "#[
regex ]#" "/[regex ]/" "@[regex ]@"
what difference?
None really. The # or / or @ are just the character you choose (arbitrary) to be the "border" of your regex . It seperates the regex from the modifiers. I think the term is delimeter? Cheers, td [edit]I know what a delimiter is, of course, just wasn't sure whether it was the correct term in this case, but I looked it up... and it is... so there ya go....[/edit]
perkiset
I think your right TD although I always see PHP regex surrounded by '/' as we ll as in javascript ... but just cause you posted that I gotta go give it a try in both...
thedarkness
I always use "@" myself as I find I have to escape "/" too often. e.g. $
regex = "@<s*as+[^>]*hrefs*=s*["'](http|https|ftp)://(.*?)["'/]@is";
preg_match_all( $regex , $target, $matches );
Cheers, td
perkiset
That's fishing great. FishING great. Din't know that lol... I have to escape the hell out of / as well. Pisses me off. Awesome! Cheersbackatcha /p
Caligula
 http://www.perkiset.org/forum/ php /quick_dirty_url_spider-t206.0.html
|