nutballs

php

  is making my brain hurt lately.

why does this find a match in café.

preg_match('#©#is','café')

I am assuming that the copyright symbol is just matching any non-ascii character, since it also does it for any sentence that has non-ascii in it.
so what magic switch do I need to flip to make this work right?

DangerMouse

I'm useless at

regex

  (lol so maybe I shouldnt be posting this) but I stumbled across this on one of the resource pages I have save, basically saying that you can search for the unicode character:

quote
"un" - Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, u00A9 matches the copyright symbol (©).

Source: http://www.

mac

 ronimous.com/resources/writing_regular_expression_with_

php

 .

asp

 


DM

nutballs

yea i should have been more clear. it does it for the u method as well as the x which is hex. neither match correctly.

perkiset

don't know the answer to that one, but this might be a wide-char issue and the MSBs are the same in both the cafe and copyright symbol... i dunno man. Wish I had a few moments free to work that one with you... looks like a solid puzzle.

DangerMouse

quote author=nutballs link=topic=790.msg5456#msg5456 date=1203719179

yea i should have been more clear. it does it for the u method as well as the x which is hex. neither match correctly.


Nah I should have read the question correctly! Sorry, long day.

nutballs

interestingly the copyright symbol also matches that text if i use strpos(). wierd.
other strings that match are:
[c’est top] [いいですね] [das ist gut] [esto es genial] [ciò è buono] [isto é bom] [هذا هو الحكم]
husband – the taste of which  (thats an em-dash in there)



Bompa

Both of these print "found":


$text = '©';

if($text =~ /©/) {
  print "Found ";
}else{
  print "not found ";
}

if($text =~ /xA9/) {
  print "Found ";
}else{
  print "not found ";
}


To insert the copyright symbol into $text and into the first

regex

 , I had to hold down ALT and press 0169.

sorry for the

perl

 .

Bomps

nutballs

there is no problem matching the ©
the problem is that

regex

  matches every single other extended character as well.

It's an issue with

PHP

 5, and will be solved in 6. But currently I am trying to figure out a workaround.

Bompa

Yah, i just noticed that my test was incomplete, so i modified it and it still works so i believe
you when you say it's a

php

  issue.

thedarkness

echo -n ©|od            = 000251  // Octal 251
echo -n ©|od -x        = 00a9    //  Hex a9

This looks like iso_8859-15 but it could be screwed by the way it's been represented on the page.

If you do an "od" of the original nust what do you get?

Is it a file? If it is you may be able to convert it using iconv.

HTH,
td

nutballs

ok i made it part of the way. I created a converter function to convert the multibyte chars into their standard equivs.
This method also works for testing the copyright symbol and registration mark.
I'm still stuck though. this code page charset stupidity of the inte

rnet

 s is driving me bonkers. Is there a way to convert a string from whatever charset it is, into UTF8?

these are all coming from live scrapes btw, so the question about coming from a file, the answer is nope, it comes from the tubes and is used for pluging up the tubes with my turds.


function convertchars($string)
{
$search = array(chr(0xe2) . chr(0x80) . chr(0x9Applause,
chr(0xe2) . chr(0x80) . chr(0x99),
chr(0xe2) . chr(0x80) . chr(0x9c),
chr(0xe2) . chr(0x80) . chr(0x9d),
chr(0xe2) . chr(0x80) . chr(0x93),
chr(0xe2) . chr(0x80) . chr(0x94),
chr(0xe2) . chr(0x80) . chr(0xa6),
chr(0xc2) . chr(0xab),
chr(0xc2) . chr(0xbb),
chr(0xc2) . chr(0xb4));

$replace = array(''',
''',
'"',
'"',
'-',
'-',
'...',
'<<',
'>>',
''');

return str_replace($search, $replace, $string);
}




if (strpos($s,chr(0xc2).chr(0xa9)) > 0)
{
$matched = true; //copyrightcymbol
$err .= 'CopyrightSymbol:';
}
if (strpos($s,chr(0xc2).chr(0xae)) > 0)
{
$matched = true; //registermark
$err .= 'RegisterMark:';
}

thedarkness

look at iconv nuts although that only works on files I'm afraid. You should be able to get what the webserver "thinks" the file is from the server header, maybe that would help?

This is what I ended up doing last time I was in a similar situation, I just blitzed everything that wasn't a "standard" char. BTW, I did this a long time ago and, just looking at it now it doesn't look the best :-)


// filesanitizer: remove unwanted chars from a csv file
// compile with:
//    g++ -O -o filesanitizer filesanitizer.cpp
//

#include <iostream>
#include <fstream>
#include <string>
#include <unistd.h>

using namespace std;

void usage( char* exename )
{
  cout << endl;
  cout << "Usage: " << exename << " targetfile" << endl << endl;
  cout << "Targetfile being the file you wish to convert." << endl << endl;

}

int main ( int argc, char** argv )
{
  if( argc != 2 )
  {
    usage( argv[0] );
    return 1;
  }

  string ifilename = argv[1];
  string ofilename = ifilename + ".tmp";
  ifstream infile( ifilename.c_str() );
  ofstream outfile( ofilename.c_str() );
  string line;
  char c;

  while( getline( infile, line ) )
  {
    //cout << line;

    size_t pos = 0;
    ( line.rfind( ",http://" ) != string::npos ) ? pos = line.rfind( ",http://"                                              ) : pos = line.rfind( ", http://" );
    string firstpart_ori = line.substr( 0, pos );
    string firstpart_new = """;
    string lastpart = line.substr( pos );
    for( pos = 0; pos < firstpart_ori.length(); pos++ )
    {
      c = firstpart_ori.at( pos );
      if( isalnum( c ) || isspace( c ) )
        firstpart_new += c;
    }

    string newline = firstpart_new + """ + lastpart;
    outfile << newline << endl;
    //outfile.putline( newline );
  }

    infile.close();
    outfile.close();
    unlink( ifilename.c_str() );
    rename( ofilename.c_str(), ifilename.c_str() );

    return 0;
}



Cheers,
td


Perkiset's Place Home   Politics @ Perkiset's