The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register.
Did you miss your activation email?
May 19, 2013, 04:49:15 AM

Login with username, password and session length


Pages: [1] 2
  Print  
Author Topic: filtering bad characters  (Read 8079 times)
jubegnx
Rookie
**
Offline Offline

Posts: 17


View Profile
« on: June 26, 2007, 02:11:15 PM »

hey all,

i made this little article scraper that basically takes a number of articles and filters out only the body of the article.
now sometimes it returns some really weird characters within the article > –

is there a perl module or function that will filter out only plain text?

thanks
Logged

No links in signatures please
dirk
Global Moderator
Expert
*****
Offline Offline

Posts: 416


View Profile
« Reply #1 on: June 26, 2007, 02:36:22 PM »

You could have a look at:

HTML::Entities - Encode or decode strings with HTML entities

We used it to get rid of such weird characters.
Logged
jubegnx
Rookie
**
Offline Offline

Posts: 17


View Profile
« Reply #2 on: June 26, 2007, 02:50:57 PM »

i was just on my way to edit the post, thats the first thing i used and it removed some but not all of the characters...

i was thinking more of loading any text file and filtering out anything thats not plain text type of thing...
Logged

No links in signatures please
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10009



View Profile
« Reply #3 on: June 26, 2007, 03:11:01 PM »

You're probably only interested in PERL, but in PHP, you can also do:

$newStr = htmlspecialchars($inputStr);

... which converts all input weirdness to their HTML encoding values ie., "&" becomes "&" etc. It'll handle all the international stuff as well - but this is all assuming that you're taking an HTML doc and converting it. If you're taking real international input outside of the web world then you'd probably look at htmlentities for PHP just like Dirk has said with PERL.

/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
jubegnx
Rookie
**
Offline Offline

Posts: 17


View Profile
« Reply #4 on: June 26, 2007, 03:44:43 PM »

i will look into that...

thanks,
Logged

No links in signatures please
Bompa
Administrator
Lifer
*****
Online Online

Posts: 566


View Profile
« Reply #5 on: June 27, 2007, 02:36:48 AM »

Instead of eliminating unwanted shit, I only allow alphanurmerics plus a few others like
the underscore, (or whatever).  So, I only allow shit i can see on my keyboard.

Well, I think that's how I do it.  Cheesy

Bompa
Logged

Whenever I point my finger, I have three pointing back at me.
dirk
Global Moderator
Expert
*****
Offline Offline

Posts: 416


View Profile
« Reply #6 on: June 27, 2007, 06:18:31 AM »

Using a regex you could skip all weird characters and keep only ASCII 0 - 127:
Code:
$string =~ s{ ( [^\x00-\x7E] ) }{}xmsg;   # ASCII 0 - 127
Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #7 on: June 27, 2007, 09:24:40 AM »

Instead of eliminating unwanted shit, I only allow alphanurmerics plus a few others like
the underscore, (or whatever). 

i also do along the lines of what bomps does.

this is an ASP function that does exactly that without using regex. I actually found this to be faster for really long text. i know this is the PERL board, but the concept is the same and doesnt use any functions that wouldnt be available in any language.
have a string of valid characters.
check eat letter in the dirty string against the valid string.
replace the character if its bad.

so for URLs i run it as   stripnonalphanumerics(someURL,"-")
for content i run it as   stripnonalphanumerics(someURL," ")

Code:
function stripnonalphanumerics(dirtystring,replacewith)
dim text,i
text=""
validstring="1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
for i = 1 to len(dirtystring)
letter=mid(dirtystring,i,1)
if instr(validstring,letter) then
text=text&letter
else
text=text&replacewith
end if
next
stripnonalphanumerics=text
end function
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10009



View Profile
« Reply #8 on: June 27, 2007, 09:42:45 AM »

Here's a PHP function to do the same:

Code:
function alphaOnly($inStr)
{
$outArr = array();
$max = strlen($inStr);
for ($i=0; $i<$max; $i++)
{
$char = ord($inStr[$i]);
if (($char > 31)) && ($char < 127))
$outArr[] = $char;
}
return implode('', $outArr);
}

/p
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
jubegnx
Rookie
**
Offline Offline

Posts: 17


View Profile
« Reply #9 on: June 27, 2007, 12:16:14 PM »

thanks for the help guys... this one did the trick    $string =~ s{ ( [^\x00-\x7E] ) }{}xmsg;

i really suck with regex, i have to practice more!
Logged

No links in signatures please
Bompa
Administrator
Lifer
*****
Online Online

Posts: 566


View Profile
« Reply #10 on: June 27, 2007, 06:04:02 PM »

Using a regex you could skip all weird characters and keep only ASCII 0 - 127:
Code:
$string =~ s{ ( [^\x00-\x7E] ) }{}xmsg;   # ASCII 0 - 127


What does the 'xmsg' do?  Those aren't regex flags are they?

Rather confusing with all the curly braces.  You're making me crosseyed.



Logged

Whenever I point my finger, I have three pointing back at me.
dirk
Global Moderator
Expert
*****
Offline Offline

Posts: 416


View Profile
« Reply #11 on: June 27, 2007, 08:21:10 PM »

I use the 'xms' based on recommendations of Perl Best Practises:

Always use the /x flag (extended formatting).
Always use the /m flag (matching line boudaries).
Always use the /s flag (matching anything).

If you use the brace delimiters {} you don't have to escape the slashes, like http:\/\/.



Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10009



View Profile
« Reply #12 on: June 27, 2007, 10:07:54 PM »

Ah!

Thanks Dirk, being a PERL st00bie, I didn't want to say anything at all... but now I get that those are the regex behavior modifiers. In PHP we do put the modifiers after the closing delimiter ie.,

/^(.*)$/ismg

thanks for clearing that one up, the syntax really had me cross eyed as well
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
dirk
Global Moderator
Expert
*****
Offline Offline

Posts: 416


View Profile
« Reply #13 on: June 28, 2007, 10:49:17 AM »

This is the usual syntax which looks more familiar:

Code:
$string =~ s/[^\x00-\x7E]//sg;   # ASCII 0 - 127
Logged
Bompa
Administrator
Lifer
*****
Online Online

Posts: 566


View Profile
« Reply #14 on: July 10, 2007, 04:53:52 AM »

Using a regex you could skip all weird characters and keep only ASCII 0 - 127:
Code:
$string =~ s{ ( [^\x00-\x7E] ) }{}xmsg;   # ASCII 0 - 127


dirk, what's with the double curly braces preceeding the xmsg;

{}xmsg;


what do the curlies do?


Bompa
Logged

Whenever I point my finger, I have three pointing back at me.
Pages: [1] 2
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!