
![]() |
nutballs
normally i am a
regexgenius, but this one has me stumped. (though it might be the pain killers im currently on).I am trying to split sentences out from paragraphs, for obvious reasons if you know what I do with my spare time... But i am getting some bad splits. Acronyms are causing problems. take the following chunk of text: A beginning sentence. This is the second satellite kill I have read, the first was with the U.S. Air Force launching a missile from a F-15 Eagle aircraft destroying another satellite reported to be no longer useful, and now the U.S. Navy has duplicated and killed a orbiting satellite from a ship. And ending sentence. If I use the following regexto split on the periods, It splits on the periods you expect, plus the period at the end of the U.S. (there is a space after the ] btw, to make sure I match only ending periods.'#[a-z0-9][.] #is' So this will match any sentence that ends in a letter or number with a period after it and a space. But it also matches the period after "U.S. ". How do i prevent it from matching ".S. " but still match "SSSS. "? vsloathe
put a NOT condition with a period before it.
E.g. '#[^.][a-z0-9][.] #is' takes care of any of that stuff...maybe? vsloathe
hmmm another solution along the same vein - how many sentences in the English language would end with a one-letter word?
Perhaps a sentence like "He is taller than I.", but hardly anyone uses that anymore, prefering instead to forego the understood verb "am" and use the incorrect "He is taller than me." At any rate, "a" is never at the end of a sentence. Thus you could put in there [a-z0-9 ] to rule out any unwanted chars (including periods) in your final word... nutballs
cool thanks, made me realize what i was doing wrong with all my attempts.
This works perfect so far. $sentences = preg_split('#(.*?[^.][a-z0-9][.!?]) #is',$paragraph,-1,PREG_SPLIT_DELIM_CAPTURE); vsloathe
That makes me feel a mite better about my
regexaptitude.![]() Glad I could help. nutballs
nah it was good. I just was doing some retarded ()()()()() stuff that was mucking it all up. Your fist example made me realize it.
dimitry12
perlhas Lingua::EN::Sentence module which I use |

Thread Categories

![]() |
![]() |
Best of The Cache Home |
![]() |
![]() |
Search The Cache |
- Ajax
- Apache & mod_rewrite
- BlackHat SEO & Web Stuff
- C/++/#, Pascal etc.
- Database Stuff
- General & Non-Technical Discussion
- General programming, learning to code
- Javascript Discussions & Code
- Linux Related
- Mac, iPhone & OS-X Stuff
- Miscellaneous
- MS Windows Related
- PERL & Python Related
- PHP: Questions & Discussion
- PHP: Techniques, Classes & Examples
- Regular Expressions
- Uncategorized Threads