
![]() |
nutballs
phpis making my brain hurt lately.why does this find a match in café. preg_match('#©#is','café') I am assuming that the copyright symbol is just matching any non-ascii character, since it also does it for any sentence that has non-ascii in it. so what magic switch do I need to flip to make this work right? DangerMouse
I'm useless at
regex(lol so maybe I shouldnt be posting this) but I stumbled across this on one of the resource pages I have save, basically saying that you can search for the unicode character:quote "un" - Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, u00A9 matches the copyright symbol (©). Source: http://www. macronimous.com/resources/writing_regular_expression_with_php.aspDM nutballs
yea i should have been more clear. it does it for the u method as well as the x which is hex. neither match correctly.
perkiset
don't know the answer to that one, but this might be a wide-char issue and the MSBs are the same in both the cafe and copyright symbol... i dunno man. Wish I had a few moments free to work that one with you... looks like a solid puzzle.
DangerMouse
quote author=nutballs link=topic=790.msg5456#msg5456 date=1203719179 yea i should have been more clear. it does it for the u method as well as the x which is hex. neither match correctly. Nah I should have read the question correctly! Sorry, long day. nutballs
interestingly the copyright symbol also matches that text if i use strpos(). wierd.
other strings that match are: [c’est top] [いいですね] [das ist gut] [esto es genial] [ciò è buono] [isto é bom] [هذا هو الحكم] husband – the taste of which (thats an em-dash in there) Bompa
Both of these print "found":
$text = '©'; if($text =~ /©/) { print "Found "; }else{ print "not found "; } if($text =~ /xA9/) { print "Found "; }else{ print "not found "; } To insert the copyright symbol into $text and into the first regex, I had to hold down ALT and press 0169.sorry for the perl.Bomps nutballs
there is no problem matching the ©
the problem is that regexmatches every single other extended character as well.It's an issue with PHP5, and will be solved in 6. But currently I am trying to figure out a workaround.Bompa
Yah, i just noticed that my test was incomplete, so i modified it and it still works so i believe
you when you say it's a phpissue.thedarkness
echo -n ©|od = 000251 // Octal 251
echo -n ©|od -x = 00a9 // Hex a9 This looks like iso_8859-15 but it could be screwed by the way it's been represented on the page. If you do an "od" of the original nust what do you get? Is it a file? If it is you may be able to convert it using iconv. HTH, td nutballs
ok i made it part of the way. I created a converter function to convert the multibyte chars into their standard equivs.
This method also works for testing the copyright symbol and registration mark. I'm still stuck though. this code page charset stupidity of the inte rnets is driving me bonkers. Is there a way to convert a string from whatever charset it is, into UTF8?these are all coming from live scrapes btw, so the question about coming from a file, the answer is nope, it comes from the tubes and is used for pluging up the tubes with my turds. function convertchars($string) { $search = array(chr(0xe2) . chr(0x80) . chr(0x9 ![]() chr(0xe2) . chr(0x80) . chr(0x99), chr(0xe2) . chr(0x80) . chr(0x9c), chr(0xe2) . chr(0x80) . chr(0x9d), chr(0xe2) . chr(0x80) . chr(0x93), chr(0xe2) . chr(0x80) . chr(0x94), chr(0xe2) . chr(0x80) . chr(0xa6), chr(0xc2) . chr(0xab), chr(0xc2) . chr(0xbb), chr(0xc2) . chr(0xb4)); $replace = array(''', ''', '"', '"', '-', '-', '...', '<<', '>>', '''); return str_replace($search, $replace, $string); } if (strpos($s,chr(0xc2).chr(0xa9)) > 0) { $matched = true; //copyrightcymbol $err .= 'CopyrightSymbol:'; } if (strpos($s,chr(0xc2).chr(0xae)) > 0) { $matched = true; //registermark $err .= 'RegisterMark:'; } thedarkness
look at iconv nuts although that only works on files I'm afraid. You should be able to get what the webserver "thinks" the file is from the server header, maybe that would help?
This is what I ended up doing last time I was in a similar situation, I just blitzed everything that wasn't a "standard" char. BTW, I did this a long time ago and, just looking at it now it doesn't look the best :-) // filesanitizer: remove unwanted chars from a csv file // compile with: // g++ -O -o filesanitizer filesanitizer.cpp // #include <iostream> #include <fstream> #include <string> #include <unistd.h> using namespace std; void usage( char* exename ) { cout << endl; cout << "Usage: " << exename << " targetfile" << endl << endl; cout << "Targetfile being the file you wish to convert." << endl << endl; } int main ( int argc, char** argv ) { if( argc != 2 ) { usage( argv[0] ); return 1; } string ifilename = argv[1]; string ofilename = ifilename + ".tmp"; ifstream infile( ifilename.c_str() ); ofstream outfile( ofilename.c_str() ); string line; char c; while( getline( infile, line ) ) { //cout << line; size_t pos = 0; ( line.rfind( ",http://" ) != string::npos ) ? pos = line.rfind( ",http://" ) : pos = line.rfind( ", http://" ); string firstpart_ori = line.substr( 0, pos ); string firstpart_new = """; string lastpart = line.substr( pos ); for( pos = 0; pos < firstpart_ori.length(); pos++ ) { c = firstpart_ori.at( pos ); if( isalnum( c ) || isspace( c ) ) firstpart_new += c; } string newline = firstpart_new + """ + lastpart; outfile << newline << endl; //outfile.putline( newline ); } infile.close(); outfile.close(); unlink( ifilename.c_str() ); rename( ofilename.c_str(), ifilename.c_str() ); return 0; } Cheers, td |

Thread Categories

![]() |
![]() |
Best of The Cache Home |
![]() |
![]() |
Search The Cache |
- Ajax
- Apache & mod_rewrite
- BlackHat SEO & Web Stuff
- C/++/#, Pascal etc.
- Database Stuff
- General & Non-Technical Discussion
- General programming, learning to code
- Javascript Discussions & Code
- Linux Related
- Mac, iPhone & OS-X Stuff
- Miscellaneous
- MS Windows Related
- PERL & Python Related
- PHP: Questions & Discussion
- PHP: Techniques, Classes & Examples
- Regular Expressions
- Uncategorized Threads