The Syndk8 - Black Hat SEO Paradise
Fantomaster - Industrial Strength Black Hat SEO Tools
Affiliate Earners: Affiliate Programs
Home
Help
Search
Login
Register
RSS for SEOIdiot
Welcome,
Guest
. Please
login
or
register
.
February 11, 2012, 01:50:35 PM
1 Hour
1 Day
1 Week
1 Month
Forever
Login with username, password and session length
Home (Index)
Level 2 Cache: Speciality Items
Regex
regex stumper
Pages: [
1
]
« previous
next »
Print
Author
Topic: regex stumper (Read 1735 times)
nutballs
Administrator
Lifer
Offline
Posts: 5604
Back in my day we had 9 planets
regex stumper
«
on:
February 21, 2008, 02:06:03 PM »
normally i am a regex genius, but this one has me stumped. (though it might be the pain killers im currently on).
I am trying to split sentences out from paragraphs, for obvious reasons if you know what I do with my spare time... But i am getting some bad splits. Acronyms are causing problems.
take the following chunk of text:
A beginning sentence. This is the second satellite kill I have read, the first was with the U.S. Air Force launching a missile from a F-15 Eagle aircraft destroying another satellite reported to be no longer useful, and now the U.S. Navy has duplicated and killed a orbiting satellite from a ship. And ending sentence.
If I use the following regex to split on the periods, It splits on the periods you expect, plus the period at the end of the U.S. (there is a space after the ] btw, to make sure I match only ending periods.
'#[a-z0-9][\.] #is'
So this will match any sentence that ends in a letter or number with a period after it and a space. But it also matches the period after "U.S. ". How do i prevent it from matching ".S. " but still match "SSSS. "?
Logged
I could eat a bowl of Alphabet Soup and shit a better argument than that.
vsloathe
vim ftw!
Global Moderator
Lifer
Offline
Posts: 1669
Re: regex stumper
«
Reply #1 on:
February 21, 2008, 02:28:17 PM »
put a NOT condition with a period before it.
E.g.
'#[^\.][a-z0-9][\.] #is'
takes care of any of that stuff...maybe?
Logged
hai
vsloathe
vim ftw!
Global Moderator
Lifer
Offline
Posts: 1669
Re: regex stumper
«
Reply #2 on:
February 21, 2008, 02:31:17 PM »
hmmm another solution along the same vein - how many sentences in the English language would end with a one-letter word?
Perhaps a sentence like "He is taller than I.", but hardly anyone uses that anymore, prefering instead to forego the understood verb "am" and use the incorrect "He is taller than me."
At any rate, "a" is never at the end of a sentence. Thus you could put in there
[a-z0-9\ ]
to rule out any unwanted chars (including periods) in your final word...
Logged
hai
nutballs
Administrator
Lifer
Offline
Posts: 5604
Back in my day we had 9 planets
Re: regex stumper
«
Reply #3 on:
February 21, 2008, 04:13:07 PM »
cool thanks, made me realize what i was doing wrong with all my attempts.
This works perfect so far.
Code:
$sentences = preg_split('#(.*?[^\.][a-z0-9][\.\!\?]) #is',$paragraph,-1,PREG_SPLIT_DELIM_CAPTURE);
Logged
I could eat a bowl of Alphabet Soup and shit a better argument than that.
vsloathe
vim ftw!
Global Moderator
Lifer
Offline
Posts: 1669
Re: regex stumper
«
Reply #4 on:
February 21, 2008, 04:27:38 PM »
That makes me feel a mite better about my regex aptitude.
Glad I could help.
Logged
hai
nutballs
Administrator
Lifer
Offline
Posts: 5604
Back in my day we had 9 planets
Re: regex stumper
«
Reply #5 on:
February 21, 2008, 05:04:37 PM »
nah it was good. I just was doing some retarded ()()()()() stuff that was mucking it all up. Your fist example made me realize it.
Logged
I could eat a bowl of Alphabet Soup and shit a better argument than that.
dimitry12
Rookie
Offline
Posts: 27
Re: regex stumper
«
Reply #6 on:
February 22, 2008, 04:27:08 PM »
perl has Lingua::EN::Sentence module which I use
Logged
Pages: [
1
]
Print
« previous
next »
Jump to:
Please select a destination:
-----------------------------
Level 1 Cache: General Discussion
-----------------------------
=> Init() - New Member Introductions
=> NEW BOARD: The n00b Zone
=> Callback Routines
=> Recovered Sectors
=> freemem() & Garbage Collection
=> All Things General Tech
=> All Things Android
=> All Things Apple
=> All Things Database
=> All Things Microsoft
=> All Things *nix
-----------------------------
Level 2 Cache: Speciality Items
-----------------------------
=> AJAX
=> ASP & .NET
=> C/++/#/Objective, Java, ObjectPascal
=> CSS, HTML & SEO, Cloaking
=> JavaScript
=> Music Technology
=> PERL
=> PHP
=> Obscurites, Curiosities & Arcanity
=> Regex
-----------------------------
Frameworks, Applications & Projects
-----------------------------
=> phpMyIDE
=> The iPhone SMF Theme Project
=> SMF
-----------------------------
Retired Boards
-----------------------------
=> ColdFusion
=> phpMyIDE
=> SalesForce / Apex
=> Javascript Code Repository & Examples
=> PHP Code Repository / Examples
Perkiset's Place Home
Best of The Cache
phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's
Pinkhat's Perspective
cache
mart
coder
programmers
ajax
php
javascript
Loading...