filtering bad characters

To The Live Thread

Thread: filtering bad characters

Back to category: PHP: Questions & Discussion

jubegnx

hey all,

i made this little article scraper that basically takes a number of articles and filters out only the body of the article.
now sometimes it returns some really weird characters within the article > ā€“

is there a

perl

module or function that will filter out only plain text?

thanks

dirk

You could have a look at:

HTML::Entities - Encode or decode strings with HTML entities

We used it to get rid of such weird characters.

jubegnx

i was just on my way to edit the post, thats the first thing i used and it removed some but not all of the characters...

i was thinking more of loading any text file and filtering out anything thats not plain text type of thing...

perkiset

You're probably only interested in

PERL

, but in

PHP

, you can also do:

$newStr = htmlspecialchars($inputStr);

... which converts all input weirdness to their HTML encoding values ie., "&" becomes "&" etc. It'll handle all the international stuff as well - but this is all assuming that you're taking an HTML doc and converting it. If you're taking real international input outside of the web world then you'd probably look at htmlentities for

PHP

just like Dirk has said with

PERL

.

/p

jubegnx

i will look into that...

thanks,

Bompa

Instead of eliminating unwanted shit, I only allow alphanurmerics plus a few others like
the underscore, (or whatever). So, I only allow shit i can see on my keyboard.

Well, I think that's how I do it. Applause

Bompa

dirk

Using a

regex

you could skip all weird characters and keep only ASCII 0 - 127:

$string =~ s{ ( [^x00-x7E] ) }{}xmsg; # ASCII 0 - 127

nutballs

quote author=Bompa link=topic=372.msg2420#msg2420 date=1182937008

Instead of eliminating unwanted shit, I only allow alphanurmerics plus a few others like
the underscore, (or whatever).

i also do along the lines of what bomps does.

this is an

ASP

function that does exactly that without using

regex

. I actually found this to be faster for really long text. i know this is the

PERL

board, but the concept is the same and doesnt use any functions that wouldnt be available in any language.
have a string of valid characters.
check eat letter in the dirty string against the valid string.
replace the character if its bad.

so for URLs i run it as stripnonalphanumerics(someURL,"-" Applause

for content i run it as stripnonalphanumerics(someURL," " Applause

function stripnonalphanumerics(dirtystring,replacewith)
dim text,i
text=""
validstring="1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
for i = 1 to len(dirtystring)
letter=mid(dirtystring,i,1)
if instr(validstring,letter) then
text=text&letter
else
text=text&replacewith
end if
next
stripnonalphanumerics=text
end function

perkiset

Here's a

PHP

function to do the same:

function alphaOnly($inStr)
{
$outArr = array();
$max = strlen($inStr);
for ($i=0; $i<$max; $i++)
{
$char = ord($inStr[$i]);
if (($char > 31)) && ($char < 127))
$outArr[] = $char;
}
return implode('', $outArr);
}

jubegnx

thanks for the help guys... this one did the trick $string =~ s{ ( [^x00-x7E] ) }{}xmsg;

i really suck with

regex

, i have to practice more!

Bompa

quote author=dirk link=topic=372.msg2424#msg2424 date=1182950311

Using a

regex

you could skip all weird characters and keep only ASCII 0 - 127:

$string =~ s{ ( [^x00-x7E] ) }{}xmsg; # ASCII 0 - 127

What does the 'xmsg' do? Those aren't

regex

flags are they?

Rather confusing with all the curly braces. You're making me crosseyed.

dirk

I use the 'xms' based on recommendations of

Perl

Best Practises:

Always use the /x flag (extended formatting).
Always use the /m flag (matching line boudaries).
Always use the /s flag (matching anything).

If you use the brace delimiters {} you don't have to escape the slashes, like http://.

perkiset

Ah!

Thanks Dirk, being a

PERL

st00bie, I didn't want to say anything at all... but now I get that those are the

regex

behavior modifiers. In

PHP

we do put the modifiers after the closing delimiter ie.,

/^(.*)$/ismg

thanks for clearing that one up, the syntax really had me cross eyed as well

dirk

This is the usual syntax which looks more familiar:

$string =~ s/[^x00-x7E]//sg; # ASCII 0 - 127

Bompa

quote author=dirk link=topic=372.msg2424#msg2424 date=1182950311

Using a

regex

you could skip all weird characters and keep only ASCII 0 - 127:

$string =~ s{ ( [^x00-x7E] ) }{}xmsg; # ASCII 0 - 127

dirk, what's with the double curly braces preceeding the xmsg;

{}xmsg;

what do the curlies do?

Bompa

dirk

Bompa,

the empty curly braces {} mean that the string in the preceeding curly braces
shall be replaced by nothing. So the special characters will be deleted.

Dirk

Bompa

quote author=dirk link=topic=372.msg2545#msg2545 date=1184071409

Bompa,

the empty curly braces {} mean that the string in the preceeding curly braces
shall be replaced by nothing. So the special characters will be deleted.

Dirk

ahhh, I finally get it.

Bompa <--- SLOW

You have an extra curly brace cuz YOU HAVE TO in order to have braces in pairs,
whereas, if we delimit with slashes, we can use jut three.

damn!

dirk

Bompa,

here are some more examples:

$string =~ s/[^x00-x7E]//;
$string =~ s|[^x00-x7E]||;
$string =~ s~[^x00-x7E]~~;

$string =~ s([^x00-x7E])();
$string =~ s[[^x00-x7E]][];
$string =~ s{[^x00-x7E]}{};

Normally only three delimiters are required.

But if you use brackets you need 4 (two pairs).

Thread Categories

		Best of The Cache Home
		Search The Cache