The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 16, 2019, 11:16:49 PM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: N-gram regex help please  (Read 6073 times)
weeman212
n00b
*
Offline Offline

Posts: 3


View Profile
« on: May 07, 2008, 03:37:30 PM »

Basically I wanted to zip through a block of text grabbing each three word phrase. I'll give an example, it will be easier.

string:

the quick brown fox jumps over the lazy dog

after running it through the regex, it should return:

the quick brown
quick brown fox
brown fox jumps
fox jumps over
jumps over the
over the lazy
the lazy dog

So my poor attempt at the regex came out like this:

preg_match_all( '/\s(.+\s.+\s.+)/ims', $1, $2);

I think the concept is correct, but I think the regex execution is wrong.

I would greatly appreciate any help.
Logged

No links in signatures please
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #1 on: May 07, 2008, 04:40:32 PM »

your on the right track, though i can't test it
(\w+\s\w+\s\w+) should be your captured part
i dont think you need that leading \s either which you have outside the parens in the front.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #2 on: May 07, 2008, 10:10:25 PM »

Here's what I would do - I decided to cleanse and normalize it first, so all non-ascii is right out and all white space is reduce to a single space. The problem I got was that many words were dropped if I used simply the finding pattern... "We can not desicrate --" for example - I got We can not 3 times, and the last word was dropped. So by cleaning it I got all the words.

Code:
<?php

$buff 
= <<<TEXT
Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, 
and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, 
can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, 
as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and 
proper that we should do this.

But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave 
men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The 
world will little note, nor long remember what we say here, but it can never forget what they did here. It is for 
us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so 
nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these 
honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- 
that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have 
a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish 
from the earth.
TEXT;

$buff preg_replace(array('/\W/''/\s+/'), ' '$buff);
preg_match_all('/\w+\s+\w+\s+\w+/m'$buff$parts);
print_r($parts);

?>



Array
(
    [ 0] => Array
        (
            [ 0] => Four score and
            [1] => seven years ago
            [2] => our fathers brought
            [3] => forth on this
            [4] => continent a new
            [5] => nation conceived in
            [6] => Liberty and dedicated
            [7] => to the proposition
            [8] => that all men
            [9] => are created equal
            [10] => Now we are
            [11] => engaged in a
            [12] => great civil war
            [13] => testing whether that
            [14] => nation or any
            [15] => nation so conceived
            [16] => and so dedicated
            [17] => can long endure
            [18] => We are met
            [19] => on a great
            [20] => battle field of
            [21] => that war We
            [22] => have come to
            [23] => dedicate a portion
            [24] => of that field
            [25] => as a final
            [26] => resting place for
            [27] => those who here
            [28] => gave their lives
            [29] => that that nation
            [30] => might live It
            [31] => is altogether fitting
            [32] => and proper that
            [33] => we should do
            [34] => this But in
            [35] => a larger sense
            [36] => we can not
            [37] => dedicate we can
            [38] => not consecrate we
            [39] => can not hallow
            [40] => this ground The
            [41] => brave men living
            [42] => and dead who
            [43] => struggled here have
            [44] => consecrated it far
            [45] => above our poor
            [46] => power to add
            [47] => or detract The
            [48] => world will little
            [49] => note nor long
            [50] => remember what we
            [51] => say here but
            [52] => it can never
            [53] => forget what they
            [54] => did here It
            [55] => is for us
            [56] => the living rather
            [57] => to be dedicated
            [58] => here to the
            [59] => unfinished work which
            [60] => they who fought
            [61] => here have thus
            [62] => far so nobly
            [63] => advanced It is
            [64] => rather for us
            [65] => to be here
            [66] => dedicated to the
            [67] => great task remaining
            [68] => before us that
            [69] => from these honored
            [70] => dead we take
            [71] => increased devotion to
            [72] => that cause for
            [73] => which they gave
            [74] => the last full
            [75] => measure of devotion
            [76] => that we here
            [77] => highly resolve that
            [78] => these dead shall
            [79] => not have died
            [80] => in vain that
            [81] => this nation under
            [82] => God shall have
            [83] => a new birth
            [84] => of freedom and
            [85] => that government of
            [86] => the people by
            [87] => the people for
            [88] => the people shall
            [89] => not perish from
        )
)
« Last Edit: May 07, 2008, 10:19:09 PM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #3 on: May 07, 2008, 10:25:32 PM »

However, looking back I notice that you want words 1-2-3, - then 2-3-4, then 3-4-5 and so forth. I'm not good enough with REGEX to understand the backward movement features, so personally I'd redo it like this:

Code:
<?php

// Assume $buff from previous example

$words explode(' 'preg_replace(array('/\W/''/\s+/'), ' '$buff));
$max count($words) - 2;
for (
$i=0$i<$max$i++)
$newList[] = "{$words[$i]} {$words[$i+1]} {$words[$i+2]}";

print_r($newList);

?>



Array
(
    [ 0] => Four score and
    [1] => score and seven
    [2] => and seven years
    [3] => seven years ago
    [4] => years ago our
    [5] => ago our fathers
    [6] => our fathers brought
    [7] => fathers brought forth
    [8] => brought forth on
    [9] => forth on this
    [10] => on this continent
    [11] => this continent a
    [12] => continent a new
    [13] => a new nation

( clip )

    [255] => that government of
    [256] => government of the
    [257] => of the people
    [258] => the people by
    [259] => people by the
    [260] => by the people
    [261] => the people for
    [262] => people for the
    [263] => for the people
    [264] => the people shall
    [265] => people shall not
    [266] => shall not perish
    [267] => not perish from
    [268] => perish from the
    [269] => from the earth
    [270] => the earth
)
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #4 on: May 08, 2008, 06:35:52 AM »

#^(\w+\s+?\w+\s+?\w+)$#
Logged

hai
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #5 on: May 08, 2008, 07:42:09 AM »

 Huh?

VS that doesn't work for me at all... "Collect 3 word-char blocks separated by 2 non-word blocks that must be the entire string?" Don't understand where you're going there...
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #6 on: May 08, 2008, 10:04:00 AM »

also the ? are unnecessary since \s only matches whitespace, making it non-greedy by default when surrounded by \w
the ^ and $ are also restricting it to only a 3 word length sentence. lol

i realized you also would probably want to first strip all non-alphanumerics out first, but if not then you would need some extra "smarts" for dealing with commas as hard boundries.

i guess if it was a big block of text:
I would first strip all "()[]{}
then I would split into chunks at every ,.;:?!
then i would regex each chunk to get the ngrams.

That would be more accurate for what I am assuming you are doing this for.
doing \w to capture word characters will freak out at non-word characters, so actually i would instead do \S (not space)
Also you need to do a lookahead, with capture, which kinda makes it a bit screwy.

so....
(\S+)(?=(\s+\S+\s+\S+))

holy shit that looks dumb... lol

but regardless:
NUTBALLS FTW!!!!


using perks sample.
Code:
<?php
$buff 
= <<<TEXT
Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, 
and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, 
can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, 
as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and 
proper that we should do this.

But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave 
men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The 
world will little note, nor long remember what we say here, but it can never forget what they did here. It is for 
us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so 
nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these 
honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- 
that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have 
a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish 
from the earth.
TEXT;

$buff preg_replace(array('/\W/''/\s+/'), ' '$buff);
preg_match_all('/(\S+)(?=(\s+\S+\s+\S+))/i'$buff$parts);
for (
$i=0;$i<=count($parts[1]);$i++)
{
echo $parts[1][$i].' '.$parts[2][$i].'<br>';
}
?>

« Last Edit: May 08, 2008, 10:05:42 AM by nutballs » Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
weeman212
n00b
*
Offline Offline

Posts: 3


View Profile
« Reply #7 on: May 08, 2008, 10:20:39 AM »

@ Perk: Thanks, seems to work great.

@NB: I'm actually stripping it down to alphanumeric, hyphens and apostrophes. Your code also works. Thanks.

I just need to clean up my text a bit better first, and looking at the sheer number of n-grams this is pulling up, I am going to need a bigger server Grin
Logged

No links in signatures please
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!