
![]() |
jubegnx
hey all,
i made this little article scraper that basically takes a number of articles and filters out only the body of the article. now sometimes it returns some really weird characters within the article > â is there a perlmodule or function that will filter out only plain text?thanks dirk
You could have a look at:
HTML::Entities - Encode or decode strings with HTML entities We used it to get rid of such weird characters. jubegnx
i was just on my way to edit the post, thats the first thing i used and it removed some but not all of the characters...
i was thinking more of loading any text file and filtering out anything thats not plain text type of thing... perkiset
You're probably only interested in
PERL, but inPHP, you can also do:$newStr = htmlspecialchars($inputStr); ... which converts all input weirdness to their HTML encoding values ie., "&" becomes "&" etc. It'll handle all the international stuff as well - but this is all assuming that you're taking an HTML doc and converting it. If you're taking real international input outside of the web world then you'd probably look at htmlentities for PHPjust like Dirk has said withPERL./p jubegnx
i will look into that...
thanks, Bompa
Instead of eliminating unwanted shit, I only allow alphanurmerics plus a few others like
the underscore, (or whatever). So, I only allow shit i can see on my keyboard. Well, I think that's how I do it. ![]() Bompa dirk
Using a
regexyou could skip all weird characters and keep only ASCII 0 - 127:$string =~ s{ ( [^x00-x7E] ) }{}xmsg; # ASCII 0 - 127 nutballs
quote author=Bompa link=topic=372.msg2420#msg2420 date=1182937008 Instead of eliminating unwanted shit, I only allow alphanurmerics plus a few others like the underscore, (or whatever). i also do along the lines of what bomps does. this is an ASPfunction that does exactly that without usingregex. I actually found this to be faster for really long text. i know this is thePERLboard, but the concept is the same and doesnt use any functions that wouldnt be available in any language.have a string of valid characters. check eat letter in the dirty string against the valid string. replace the character if its bad. so for URLs i run it as stripnonalphanumerics(someURL,"-" ![]() for content i run it as stripnonalphanumerics(someURL," " ![]() function stripnonalphanumerics(dirtystring,replacewith) dim text,i text="" validstring="1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" for i = 1 to len(dirtystring) letter=mid(dirtystring,i,1) if instr(validstring,letter) then text=text&letter else text=text&replacewith end if next stripnonalphanumerics=text end function perkiset
Here's a
PHPfunction to do the same:function alphaOnly($inStr) { $outArr = array(); $max = strlen($inStr); for ($i=0; $i<$max; $i++) { $char = ord($inStr[$i]); if (($char > 31)) && ($char < 127)) $outArr[] = $char; } return implode('', $outArr); } /p jubegnx
thanks for the help guys... this one did the trick $string =~ s{ ( [^x00-x7E] ) }{}xmsg;
i really suck with regex, i have to practice more!Bompa
quote author=dirk link=topic=372.msg2424#msg2424 date=1182950311 Using a regexyou could skip all weird characters and keep only ASCII 0 - 127:$string =~ s{ ( [^x00-x7E] ) }{}xmsg; # ASCII 0 - 127 What does the 'xmsg' do? Those aren't regexflags are they?Rather confusing with all the curly braces. You're making me crosseyed. dirk
I use the 'xms' based on recommendations of
PerlBest Practises:Always use the /x flag (extended formatting). Always use the /m flag (matching line boudaries). Always use the /s flag (matching anything). If you use the brace delimiters {} you don't have to escape the slashes, like http://. perkiset
Ah!
Thanks Dirk, being a PERLst00bie, I didn't want to say anything at all... but now I get that those are theregexbehavior modifiers. InPHPwe do put the modifiers after the closing delimiter ie.,/^(.*)$/ismg thanks for clearing that one up, the syntax really had me cross eyed as well dirk
This is the usual syntax which looks more familiar:
$string =~ s/[^x00-x7E]//sg; # ASCII 0 - 127 Bompa
quote author=dirk link=topic=372.msg2424#msg2424 date=1182950311 Using a regexyou could skip all weird characters and keep only ASCII 0 - 127:$string =~ s{ ( [^x00-x7E] ) }{}xmsg; # ASCII 0 - 127 dirk, what's with the double curly braces preceeding the xmsg; {}xmsg; what do the curlies do? Bompa dirk
Bompa,
the empty curly braces {} mean that the string in the preceeding curly braces shall be replaced by nothing. So the special characters will be deleted. Dirk Bompa
quote author=dirk link=topic=372.msg2545#msg2545 date=1184071409 Bompa, the empty curly braces {} mean that the string in the preceeding curly braces shall be replaced by nothing. So the special characters will be deleted. Dirk ahhh, I finally get it. Bompa <--- SLOW You have an extra curly brace cuz YOU HAVE TO in order to have braces in pairs, whereas, if we delimit with slashes, we can use jut three. damn! dirk
Bompa,
here are some more examples: $string =~ s/[^x00-x7E]//; $string =~ s|[^x00-x7E]||; $string =~ s~[^x00-x7E]~~; $string =~ s([^x00-x7E])(); $string =~ s[[^x00-x7E]][]; $string =~ s{[^x00-x7E]}{}; Normally only three delimiters are required. But if you use brackets you need 4 (two pairs). |

Thread Categories

![]() |
![]() |
Best of The Cache Home |
![]() |
![]() |
Search The Cache |
- Ajax
- Apache & mod_rewrite
- BlackHat SEO & Web Stuff
- C/++/#, Pascal etc.
- Database Stuff
- General & Non-Technical Discussion
- General programming, learning to code
- Javascript Discussions & Code
- Linux Related
- Mac, iPhone & OS-X Stuff
- Miscellaneous
- MS Windows Related
- PERL & Python Related
- PHP: Questions & Discussion
- PHP: Techniques, Classes & Examples
- Regular Expressions
- Uncategorized Threads