Thread: regex stumper

normally i am a


  genius, but this one has me stumped. (though it might be the pain killers im currently on).

I am trying to split sentences out from paragraphs, for obvious reasons if you know what I do with my spare time... But i am getting some bad splits. Acronyms are causing problems.

take the following chunk of text:
A beginning sentence. This is the second satellite kill I have read, the first was with the U.S. Air Force launching a missile from a F-15 Eagle aircraft destroying another satellite reported to be no longer useful, and now the U.S. Navy has duplicated and killed a orbiting satellite from a ship. And ending sentence.

If I use the following


  to split on the periods, It splits on the periods you expect, plus the period at the end of the U.S. (there is a space after the ] btw, to make sure I match only ending periods.
'#[a-z0-9][.] #is'

So this will match any sentence that ends in a letter or number with a period after it and a space. But it also matches the period after "U.S. ". How do i prevent it from matching ".S. " but still match "SSSS. "?


put a NOT condition with a period before it.


'#[^.][a-z0-9][.] #is'

takes care of any of that stuff...maybe?


hmmm another solution along the same vein - how many sentences in the English language would end with a one-letter word?

Perhaps a sentence like "He is taller than I.", but hardly anyone uses that anymore, prefering instead to forego the understood verb "am" and use the incorrect "He is taller than me."

At any rate, "a" is never at the end of a sentence. Thus you could put in there

[a-z0-9 ]

to rule out any unwanted chars (including periods) in your final word...


cool thanks, made me realize what i was doing wrong with all my attempts.
This works perfect so far.

$sentences = preg_split('#(.*?[^.][a-z0-9][.!?]) #is',$paragraph,-1,PREG_SPLIT_DELIM_CAPTURE);


That makes me feel a mite better about my




Glad I could help.


nah it was good. I just was doing some retarded ()()()()() stuff that was mucking it all up. Your fist example made me realize it.



  has Lingua::EN::Sentence module which I use

Perkiset's Place Home   Politics @ Perkiset's