The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. October 16, 2019, 05:15:20 AM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: Weird characters in scrape targets  (Read 3512 times)
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« on: November 12, 2007, 06:35:29 PM »

so in PHP, how do I deal with non-ascii characters, that are not written as HTML entity references.

examples,
when you rip them, they come out as ’ for example.

I am using perks webrequest class to grab the stuff, but then my own giant list of regex's to strip out all the garbage. However, when i display the final cleaned up strings, those characters are funky.

thoughts?
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
nop_90
Global Moderator
Lifer
*****
Offline Offline

Posts: 2203


View Profile
« Reply #1 on: November 12, 2007, 06:42:16 PM »

u can either use the regexp on the board to eliminate all non ascii chars.
or you can mess arround with utf8 encodings.
Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #2 on: November 12, 2007, 06:51:00 PM »

I am curious about the encodings as well. I just looked at the database and the interesting characters are even stored as  and viewable / editable in phpMyAdmin, so clearly there's a reasonably easy way to deal with them... just have never done it.

If I get a moment I'll see if I can ferret out where that is happening in the SMF codebase...
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #3 on: November 12, 2007, 07:44:18 PM »

yea, its a character encoding issue. It must be PHP that is doing it during a string conversion or regex is doing it. I will see if i can figure out which step in the process causes it.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #4 on: November 12, 2007, 08:30:38 PM »

Its the web request class thats doing it apparently. must be the encoding, any thoughts?

this results in what I am talking about.
Code:
$url='http://www.perkiset.org/politics/2007/11/11/where-have-all-the-hippies-gone/';
$req = new WebRequest2();
echo $req->simpleGet($url);
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #5 on: November 12, 2007, 10:02:34 PM »

I think it's actually fine, even though it looks weird in whatever you're looking at it with. The funny characters appear correctly in a browser, even if they appear really weird in the console. They store correctly on the disk and in a DB.

I just tested by doing this:
Code:
<?php

$req
->domain 'www.perkiset.org';
$req->url '/politics/2007/11/11/where-have-all-the-hippies-gone/';
$req->dispatch();
echo 
$req->getContent();
file_put_contents('/www/sites/utility/lastget.html'$req->getContent());

?>


When I do this calling it from a browser, the browser displays the characters perfectly. Then if I call for lastget.html the characters are also perfect. I think this is an issue of looking at the utf-8 from something other than a browser. Even the code you posted above works perfectly for me when called from a browser... (only tested in FF and Safari, but that's probably good enough...)
« Last Edit: November 12, 2007, 10:05:02 PM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nop_90
Global Moderator
Lifer
*****
Offline Offline

Posts: 2203


View Profile
« Reply #6 on: November 12, 2007, 10:50:45 PM »

Not sure what the correct term is.

for the browser is it set by this tag
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
you can change the charset to different encodings.

so on this board it right now set at utf-8
but if u scrape and use different encoding it will be fuked up.

perl have different codec you use to translate between the encodings
http://perldoc.perl.org/utf8.html
so does python
http://evanjones.ca/python-utf8.html

where the problem happens is that it possible that the char is not valid for the encoding you have picked, then the codec make error message.
this happens when u do shit like scrape a russian site (char set will be set at like what ever russian uses), but meanwhile text of russian site in english.
so now u have english with russian char set.
fuking pain in the ass
Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #7 on: November 13, 2007, 09:35:39 AM »

AHHHHHH duh!

on my test page i have no complete HTML page structure, its just straight output.
so as a result, no DOCTYPE, no head, no body, no characterset nothing. I'm guessing you do...

thats why its displaying fucked up.  D'oh!

<jedi>You don't need to read this thread, move along</jedi>
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!