Stack Overflow Asked on December 3, 2021
The following regex is suppose to act as a sentence tokenizer pattern, but I’m having some trouble deciphering what exactly it’s doing:
(?<!w.w.)(?<![A-Z][a-z].)(?<![A-Z].)(?<=.|?|!)s
I understand that it’s using positive and negative lookbehinds, as the accepted answer of this post explains (they give the example of a negative lookbehind like this: (?<!B)A
). But what is considered A
in the above regex?
The regex is checking for breaks between sentences. The negative lookbehinds prevent false matches that represent abbreviations instead of the ends of sentences. They mean:
(?<!w.w.)
Don't match anything that looks like A.b., 2.c., or 1.3. (Probably they meant for the second period to also be .
to match only a period, but as written it will match any character at the end, for example A.b! or g.Z4)(?<![A-Z][a-z].)
Don't match anything that looks like Cf., Dr., Mr., etc. Note this only checks two characters, so "Mrs." will be matched incorrectly.(?<![A-Z].)
Don't match anything that looks like A. or C.Then if these all pass, it has a positive lookbehind (?<=.|?|!)
to check for .
, ?
or !
.
And finally it matches on any whitespace s
.
Answered by jdaz on December 3, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP