
![]() |
nutballs
so in
PHP, how do I deal with non-ascii characters, that are not written as HTML entity references.examples, “ ” ’ when you rip them, they come out as ’ for example. I am using perks webrequest class to grab the stuff, but then my own giant list of regex's to strip out all the garbage. However, when i display the final cleaned up strings, those characters are funky.thoughts? nop_90
u can either use the
regexp on the board to eliminate all non ascii chars.or you can mess arround with utf8 encodings. perkiset
I am curious about the encodings as well. I just looked at the database and the interesting characters are even stored as “ ” ’ and viewable / editable in
phpMyAdmin, so clearly there's a reasonably easy way to deal with them... just have never done it.If I get a moment I'll see if I can ferret out where that is happening in the SMF codebase... nutballs
yea, its a character encoding issue. It must be
PHPthat is doing it during a string conversion orregexis doing it. I will see if i can figure out which step in the process causes it.nutballs
Its the web request class thats doing it apparently. must be the encoding, any thoughts?
this results in what I am talking about. $url='http://www.perkiset.org/politics/2007/11/11/where-have-all-the-hippies-gone/'; $req = new WebRequest2(); echo $req->simpleGet($url); perkiset
I think it's actually fine, even though it looks weird in whatever you're looking at it with. The funny characters ap
pearcorrectly in a browser, even if they appearreally weird in the console. They store correctly on the disk and in a DB.I just tested by doing this: <? php$req->domain = 'www.perkiset.org'; $req->url = '/politics/2007/11/11/where-have-all-the-hippies-gone/'; $req->dispatch(); echo $req->getContent(); file_put_contents('/www/sites/utility/lastget.html', $req->getContent()); ?> When I do this calling it from a browser, the browser displays the characters perfectly. Then if I call for lastget.html the characters are also perfect. I think this is an issue of looking at the utf-8 from something other than a browser. Even the code you posted above works perfectly for me when called from a browser... (only tested in FF and Safari, but that's probably good enough...) nop_90
Not sure what the correct term is.
for the browser is it set by this tag <meta http-equiv="content-type" content="text/html; charset=UTF-8"> you can change the charset to different encodings. so on this board it right now set at utf-8 but if u scrape and use different encoding it will be fuked up. perlhave different codec you use to translate between the encodingshttp:// perldoc.perl.org/utf8.htmlso does pythonhttp://evanjones.ca/ python-utf8.htmlwhere the problem happens is that it possible that the char is not valid for the encoding you have picked, then the codec make error message. this happens when u do shit like scrape a russian site (char set will be set at like what ever russian uses), but meanwhile text of russian site in english. so now u have english with russian char set. fuking pain in the ass nutballs
AHHHHHH duh!
on my test page i have no complete HTML page structure, its just straight output. so as a result, no DOCTYPE, no head, no body, no characterset nothing. I'm guessing you do... thats why its displaying fished up. ![]() <jedi>You don't need to read this thread, move along</jedi> |

Thread Categories

![]() |
![]() |
Best of The Cache Home |
![]() |
![]() |
Search The Cache |
- Ajax
- Apache & mod_rewrite
- BlackHat SEO & Web Stuff
- C/++/#, Pascal etc.
- Database Stuff
- General & Non-Technical Discussion
- General programming, learning to code
- Javascript Discussions & Code
- Linux Related
- Mac, iPhone & OS-X Stuff
- Miscellaneous
- MS Windows Related
- PERL & Python Related
- PHP: Questions & Discussion
- PHP: Techniques, Classes & Examples
- Regular Expressions
- Uncategorized Threads