Thread: regex stumper
nutballs

normally i am a

regex

  genius, but this one has me stumped. (though it might be the pain killers im currently on).

I am trying to split sentences out from paragraphs, for obvious reasons if you know what I do with my spare time... But i am getting some bad splits. Acronyms are causing problems.

take the following chunk of text:
A beginning sentence. This is the second satellite kill I have read, the first was with the U.S. Air Force launching a missile from a F-15 Eagle aircraft destroying another satellite reported to be no longer useful, and now the U.S. Navy has duplicated and killed a orbiting satellite from a ship. And ending sentence.

If I use the following

regex

  to split on the periods, It splits on the periods you expect, plus the period at the end of the U.S. (there is a space after the ] btw, to make sure I match only ending periods.
'#[a-z0-9][.] #is'

So this will match any sentence that ends in a letter or number with a period after it and a space. But it also matches the period after "U.S. ". How do i prevent it from matching ".S. " but still match "SSSS. "?

vsloathe

put a NOT condition with a period before it.

E.g.

'#[^.][a-z0-9][.] #is'

takes care of any of that stuff...maybe?

vsloathe

hmmm another solution along the same vein - how many sentences in the English language would end with a one-letter word?

Perhaps a sentence like "He is taller than I.", but hardly anyone uses that anymore, prefering instead to forego the understood verb "am" and use the incorrect "He is taller than me."

At any rate, "a" is never at the end of a sentence. Thus you could put in there

[a-z0-9 ]

to rule out any unwanted chars (including periods) in your final word...

nutballs

cool thanks, made me realize what i was doing wrong with all my attempts.
This works perfect so far.

$sentences = preg_split('#(.*?[^.][a-z0-9][.!?]) #is',$paragraph,-1,PREG_SPLIT_DELIM_CAPTURE);


vsloathe

That makes me feel a mite better about my

regex

  aptitude.

Applause

Glad I could help.

nutballs

nah it was good. I just was doing some retarded ()()()()() stuff that was mucking it all up. Your fist example made me realize it.

dimitry12

perl

  has Lingua::EN::Sentence module which I use


Perkiset's Place Home   Politics @ Perkiset's