The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 15, 2019, 07:33:00 AM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: regex copyright symbol  (Read 7871 times)
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« on: February 22, 2008, 02:36:20 PM »

php is making my brain hurt lately.

why does this find a match in caf.

preg_match('##is','caf')

I am assuming that the copyright symbol is just matching any non-ascii character, since it also does it for any sentence that has non-ascii in it.
so what magic switch do I need to flip to make this work right?
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #1 on: February 22, 2008, 03:12:17 PM »

I'm useless at regex (lol so maybe I shouldnt be posting this) but I stumbled across this on one of the resource pages I have save, basically saying that you can search for the unicode character:

Quote
"\un" - Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, \u00A9 matches the copyright symbol ().
Source: http://www.macronimous.com/resources/writing_regular_expression_with_php.asp

DM
Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #2 on: February 22, 2008, 03:26:19 PM »

yea i should have been more clear. it does it for the \u method as well as the \x which is hex. neither match correctly.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #3 on: February 22, 2008, 03:56:11 PM »

don't know the answer to that one, but this might be a wide-char issue and the MSBs are the same in both the cafe and copyright symbol... i dunno man. Wish I had a few moments free to work that one with you... looks like a solid puzzle.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #4 on: February 22, 2008, 04:12:09 PM »

yea i should have been more clear. it does it for the \u method as well as the \x which is hex. neither match correctly.

Nah I should have read the question correctly! Sorry, long day.
Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #5 on: February 22, 2008, 04:41:41 PM »

interestingly the copyright symbol also matches that text if i use strpos(). wierd.
other strings that match are:
[cest top] [いいですね] [das ist gut] [esto es genial] [ci buono] [isto bom] [هذا هو الحكم]
husband the taste of which  (thats an em-dash in there)



Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #6 on: February 23, 2008, 09:55:40 PM »

Both of these print "found":


$text = '';

if($text =~ //) {
  print "Found\n";
}else{
  print "not found\n";
}

if($text =~ /\xA9/) {
  print "Found\n";
}else{
  print "not found\n";
}


To insert the copyright symbol into $text and into the first regex, I had to hold down ALT and press 0169.

sorry for the perl.

Bomps
« Last Edit: February 23, 2008, 09:58:20 PM by Bompa » Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #7 on: February 23, 2008, 10:38:29 PM »

there is no problem matching the
the problem is that regex matches every single other extended character as well.

It's an issue with PHP5, and will be solved in 6. But currently I am trying to figure out a workaround.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #8 on: February 23, 2008, 10:49:18 PM »

Yah, i just noticed that my test was incomplete, so i modified it and it still works so i believe
you when you say it's a php issue.

Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
thedarkness
Lifer
*****
Offline Offline

Posts: 585



View Profile
« Reply #9 on: February 24, 2008, 05:55:07 PM »

echo -n |od            = 000251  // Octal 251
echo -n |od -x        = 00a9     //  Hex a9

This looks like iso_8859-15 but it could be screwed by the way it's been represented on the page.

If you do an "od" of the original nust what do you get?

Is it a file? If it is you may be able to convert it using iconv.

HTH,
td
Logged

"I want to be the guy my dog thinks I am."
 - Unknown
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #10 on: February 25, 2008, 08:23:48 AM »

ok i made it part of the way. I created a converter function to convert the multibyte chars into their standard equivs.
This method also works for testing the copyright symbol and registration mark.
I'm still stuck though. this code page charset stupidity of the internets is driving me bonkers. Is there a way to convert a string from whatever charset it is, into UTF8?

these are all coming from live scrapes btw, so the question about coming from a file, the answer is nope, it comes from the tubes and is used for pluging up the tubes with my turds.

Code:
function convertchars($string)
{
$search = array(chr(0xe2) . chr(0x80) . chr(0x98),
chr(0xe2) . chr(0x80) . chr(0x99),
chr(0xe2) . chr(0x80) . chr(0x9c),
chr(0xe2) . chr(0x80) . chr(0x9d),
chr(0xe2) . chr(0x80) . chr(0x93),
chr(0xe2) . chr(0x80) . chr(0x94),
chr(0xe2) . chr(0x80) . chr(0xa6),
chr(0xc2) . chr(0xab),
chr(0xc2) . chr(0xbb),
chr(0xc2) . chr(0xb4));

$replace = array('\'',
'\'',
'"',
'"',
'-',
'-',
'...',
'<<',
'>>',
'\'');

return str_replace($search, $replace, $string);
}

Code:

if (strpos($s,chr(0xc2).chr(0xa9)) > 0)
{
$matched = true; //copyrightcymbol
$err .= 'CopyrightSymbol:';
}
if (strpos($s,chr(0xc2).chr(0xae)) > 0)
{
$matched = true; //registermark
$err .= 'RegisterMark:';
}
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
thedarkness
Lifer
*****
Offline Offline

Posts: 585



View Profile
« Reply #11 on: February 25, 2008, 03:20:11 PM »

look at iconv nuts although that only works on files I'm afraid. You should be able to get what the webserver "thinks" the file is from the server header, maybe that would help?

This is what I ended up doing last time I was in a similar situation, I just blitzed everything that wasn't a "standard" char. BTW, I did this a long time ago and, just looking at it now it doesn't look the best :-)

Code:
// filesanitizer: remove unwanted chars from a csv file
// compile with:
//    g++ -O -o filesanitizer filesanitizer.cpp
//

#include <iostream>
#include <fstream>
#include <string>
#include <unistd.h>

using namespace std;

void usage( char* exename )
{
  cout << endl;
  cout << "Usage: " << exename << " targetfile" << endl << endl;
  cout << "Targetfile being the file you wish to convert." << endl << endl;

}

int main ( int argc, char** argv )
{
  if( argc != 2 )
  {
    usage( argv[0] );
    return 1;
  }

  string ifilename = argv[1];
  string ofilename = ifilename + ".tmp";
  ifstream infile( ifilename.c_str() );
  ofstream outfile( ofilename.c_str() );
  string line;
  char c;

  while( getline( infile, line ) )
  {
    //cout << line;

    size_t pos = 0;
    ( line.rfind( ",http://" ) != string::npos ) ? pos = line.rfind( ",http://"                                              ) : pos = line.rfind( ", http://" );
    string firstpart_ori = line.substr( 0, pos );
    string firstpart_new = "\"";
    string lastpart = line.substr( pos );
    for( pos = 0; pos < firstpart_ori.length(); pos++ )
    {
      c = firstpart_ori.at( pos );
      if( isalnum( c ) || isspace( c ) )
        firstpart_new += c;
    }

    string newline = firstpart_new + "\"" + lastpart;
    outfile << newline << endl;
    //outfile.putline( newline );
  }

    infile.close();
    outfile.close();
    unlink( ifilename.c_str() );
    rename( ofilename.c_str(), ifilename.c_str() );

    return 0;
}


Cheers,
td
« Last Edit: February 25, 2008, 03:27:15 PM by thedarkness » Logged

"I want to be the guy my dog thinks I am."
 - Unknown
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!