Weird characters in scrape targets

To The Live Thread

Thread: Weird characters in scrape targets

Back to category: PHP: Questions & Discussion

nutballs

so in

PHP

, how do I deal with non-ascii characters, that are not written as HTML entity references.

examples, У Ф Т
when you rip them, they come out as тАЩ for example.

I am using perks webrequest class to grab the stuff, but then my own giant list of

regex

's to strip out all the garbage. However, when i display the final cleaned up strings, those characters are funky.

thoughts?

nop_90

u can either use the

regex

p on the board to eliminate all non ascii chars.
or you can mess arround with utf8 encodings.

perkiset

I am curious about the encodings as well. I just looked at the database and the interesting characters are even stored as У Ф Т and viewable / editable in

php

MyAdmin, so clearly there's a reasonably easy way to deal with them... just have never done it.

If I get a moment I'll see if I can ferret out where that is happening in the SMF codebase...

nutballs

yea, its a character encoding issue. It must be

PHP

that is doing it during a string conversion or

regex

is doing it. I will see if i can figure out which step in the process causes it.

nutballs

Its the web request class thats doing it apparently. must be the encoding, any thoughts?

this results in what I am talking about.

$url='http://www.perkiset.org/politics/2007/11/11/where-have-all-the-hippies-gone/';
$req = new WebRequest2();
echo $req->simpleGet($url);

perkiset

I think it's actually fine, even though it looks weird in whatever you're looking at it with. The funny characters ap

pear

correctly in a browser, even if they ap

pear

really weird in the console. They store correctly on the disk and in a DB.

I just tested by doing this:

php

$req->domain = 'www.perkiset.org';
$req->url = '/politics/2007/11/11/where-have-all-the-hippies-gone/';
$req->dispatch();
echo $req->getContent();
file_put_contents('/www/sites/utility/lastget.html', $req->getContent());

?>

When I do this calling it from a browser, the browser displays the characters perfectly. Then if I call for lastget.html the characters are also perfect. I think this is an issue of looking at the utf-8 from something other than a browser. Even the code you posted above works perfectly for me when called from a browser... (only tested in FF and Safari, but that's probably good enough...)

nop_90

Not sure what the correct term is.

for the browser is it set by this tag
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
you can change the charset to different encodings.

so on this board it right now set at utf-8
but if u scrape and use different encoding it will be fuked up.

perl

have different codec you use to translate between the encodings
http://

perl

doc.

perl

.org/utf8.html
so does

python

http://evanjones.ca/

python

-utf8.html

where the problem happens is that it possible that the char is not valid for the encoding you have picked, then the codec make error message.
this happens when u do shit like scrape a russian site (char set will be set at like what ever russian uses), but meanwhile text of russian site in english.
so now u have english with russian char set.
fuking pain in the ass

nutballs

AHHHHHH duh!

on my test page i have no complete HTML page structure, its just straight output.
so as a result, no DOCTYPE, no head, no body, no characterset nothing. I'm guessing you do...

thats why its displaying fished up. Applause

<jedi>You don't need to read this thread, move along</jedi>

Thread Categories

		Best of The Cache Home
		Search The Cache