jubegnx

hey all,

i made this little article scraper that basically takes a number of articles and filters out only the body of the article.
now sometimes it returns some really weird characters within the article > –

is there a

perl

  module or function that will filter out only plain text?

thanks

dirk

You could have a look at:

HTML::Entities - Encode or decode strings with HTML entities

We used it to get rid of such weird characters.

jubegnx

i was just on my way to edit the post, thats the first thing i used and it removed some but not all of the characters...

i was thinking more of loading any text file and filtering out anything thats not plain text type of thing...

perkiset

You're probably only interested in

PERL

 , but in

PHP

 , you can also do:

$newStr = htmlspecialchars($inputStr);

... which converts all input weirdness to their HTML encoding values ie., "&" becomes "&" etc. It'll handle all the international stuff as well - but this is all assuming that you're taking an HTML doc and converting it. If you're taking real international input outside of the web world then you'd probably look at htmlentities for

PHP

  just like Dirk has said with

PERL

 .

/p

jubegnx

i will look into that...

thanks,

Bompa

Instead of eliminating unwanted shit, I only allow alphanurmerics plus a few others like
the underscore, (or whatever).  So, I only allow shit i can see on my keyboard.

Well, I think that's how I do it.  Applause

Bompa

dirk

Using a

regex

  you could skip all weird characters and keep only ASCII 0 - 127:

$string =~ s{ ( [^x00-x7E] ) }{}xmsg;  # ASCII 0 - 127

nutballs

quote author=Bompa link=topic=372.msg2420#msg2420 date=1182937008

Instead of eliminating unwanted shit, I only allow alphanurmerics plus a few others like
the underscore, (or whatever). 


i also do along the lines of what bomps does.

this is an

ASP

  function that does exactly that without using

regex

 . I actually found this to be faster for really long text. i know this is the

PERL

  board, but the concept is the same and doesnt use any functions that wouldnt be available in any language.
have a string of valid characters.
check eat letter in the dirty string against the valid string.
replace the character if its bad.

so for URLs i run it as  stripnonalphanumerics(someURL,"-"Applause
for content i run it as  stripnonalphanumerics(someURL," "Applause


function stripnonalphanumerics(dirtystring,replacewith)
dim text,i
text=""
validstring="1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
for i = 1 to len(dirtystring)
letter=mid(dirtystring,i,1)
if instr(validstring,letter) then
text=text&letter
else
text=text&replacewith
end if
next
stripnonalphanumerics=text
end function

perkiset

Here's a

PHP

  function to do the same:


function alphaOnly($inStr)
{
$outArr = array();
$max = strlen($inStr);
for ($i=0; $i<$max; $i++)
{
$char = ord($inStr[$i]);
if (($char > 31)) && ($char < 127))
$outArr[] = $char;
}
return implode('', $outArr);
}


/p

jubegnx

thanks for the help guys... this one did the trick    $string =~ s{ ( [^x00-x7E] ) }{}xmsg;

i really suck with

regex

 , i have to practice more!

Bompa

quote author=dirk link=topic=372.msg2424#msg2424 date=1182950311

Using a

regex

  you could skip all weird characters and keep only ASCII 0 - 127:

$string =~ s{ ( [^x00-x7E] ) }{}xmsg;   # ASCII 0 - 127




What does the 'xmsg' do?  Those aren't

regex

  flags are they?

Rather confusing with all the curly braces.  You're making me crosseyed.



dirk

I use the 'xms' based on recommendations of

Perl

  Best Practises:

Always use the /x flag (extended formatting).
Always use the /m flag (matching line boudaries).
Always use the /s flag (matching anything).

If you use the brace delimiters {} you don't have to escape the slashes, like http://.



perkiset

Ah!

Thanks Dirk, being a

PERL

  st00bie, I didn't want to say anything at all... but now I get that those are the

regex

  behavior modifiers. In

PHP

  we do put the modifiers after the closing delimiter ie.,

/^(.*)$/ismg

thanks for clearing that one up, the syntax really had me cross eyed as well

dirk

This is the usual syntax which looks more familiar:


$string =~ s/[^x00-x7E]//sg;  # ASCII 0 - 127

Bompa

quote author=dirk link=topic=372.msg2424#msg2424 date=1182950311

Using a

regex

  you could skip all weird characters and keep only ASCII 0 - 127:

$string =~ s{ ( [^x00-x7E] ) }{}xmsg;   # ASCII 0 - 127




dirk, what's with the double curly braces preceeding the xmsg;

{}xmsg;


what do the curlies do?


Bompa

dirk

Bompa,

the empty curly braces {} mean that the string in the preceeding curly braces
shall be replaced by nothing. So the special characters will be deleted.

Dirk

Bompa

quote author=dirk link=topic=372.msg2545#msg2545 date=1184071409

Bompa,

the empty curly braces {} mean that the string in the preceeding curly braces
shall be replaced by nothing. So the special characters will be deleted.

Dirk


ahhh, I finally get it. 

Bompa <--- SLOW

You have an extra curly brace cuz YOU HAVE TO in order to have braces in pairs,
whereas, if we delimit with slashes, we can use jut three.

damn!

dirk

Bompa,

here are some more examples:


$string =~ s/[^x00-x7E]//;
$string =~ s|[^x00-x7E]||;
$string =~ s~[^x00-x7E]~~;

$string =~ s([^x00-x7E])();
$string =~ s[[^x00-x7E]][];
$string =~ s{[^x00-x7E]}{};


Normally only three delimiters are required.

But if you use brackets you need 4 (two pairs).


Perkiset's Place Home   Politics @ Perkiset's