The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 16, 2019, 10:05:41 PM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: regex stumper  (Read 3846 times)
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« on: February 21, 2008, 02:06:03 PM »

normally i am a regex genius, but this one has me stumped. (though it might be the pain killers im currently on).

I am trying to split sentences out from paragraphs, for obvious reasons if you know what I do with my spare time... But i am getting some bad splits. Acronyms are causing problems.

take the following chunk of text:
A beginning sentence. This is the second satellite kill I have read, the first was with the U.S. Air Force launching a missile from a F-15 Eagle aircraft destroying another satellite reported to be no longer useful, and now the U.S. Navy has duplicated and killed a orbiting satellite from a ship. And ending sentence.

If I use the following regex to split on the periods, It splits on the periods you expect, plus the period at the end of the U.S. (there is a space after the ] btw, to make sure I match only ending periods.
'#[a-z0-9][\.] #is'

So this will match any sentence that ends in a letter or number with a period after it and a space. But it also matches the period after "U.S. ". How do i prevent it from matching ".S. " but still match "SSSS. "?

Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #1 on: February 21, 2008, 02:28:17 PM »

put a NOT condition with a period before it.

E.g.

'#[^\.][a-z0-9][\.] #is'

takes care of any of that stuff...maybe?
Logged

hai
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #2 on: February 21, 2008, 02:31:17 PM »

hmmm another solution along the same vein - how many sentences in the English language would end with a one-letter word?

Perhaps a sentence like "He is taller than I.", but hardly anyone uses that anymore, prefering instead to forego the understood verb "am" and use the incorrect "He is taller than me."

At any rate, "a" is never at the end of a sentence. Thus you could put in there

[a-z0-9\ ]

to rule out any unwanted chars (including periods) in your final word...
Logged

hai
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #3 on: February 21, 2008, 04:13:07 PM »

cool thanks, made me realize what i was doing wrong with all my attempts.
This works perfect so far.
Code:
$sentences = preg_split('#(.*?[^\.][a-z0-9][\.\!\?]) #is',$paragraph,-1,PREG_SPLIT_DELIM_CAPTURE);

Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
vsloathe
vim ftw!
Global Moderator
Lifer
*****
Offline Offline

Posts: 1669



View Profile
« Reply #4 on: February 21, 2008, 04:27:38 PM »

That makes me feel a mite better about my regex aptitude.

 ROFLMAO

Glad I could help.
Logged

hai
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #5 on: February 21, 2008, 05:04:37 PM »

nah it was good. I just was doing some retarded ()()()()() stuff that was mucking it all up. Your fist example made me realize it.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
dimitry12
Rookie
**
Offline Offline

Posts: 27



View Profile
« Reply #6 on: February 22, 2008, 04:27:08 PM »

perl has Lingua::EN::Sentence module which I use
Logged
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!