TransWikia.com

Using Regular Expression

Mathematica Asked on October 22, 2021

sample2 = "this is a test to find whether pi works for other words or not as well pi pi pi potatpi pineapple pi-neapple";

so sample2 is a string, and I want to search the word/substring "pi" only, not in "pi-neapple" or "potatpi"

The code I tried:

StringCases[sample2, RegularExpression["\b(pi)\b"]]

output :

{"pi", "pi", "pi", "pi", "pi"}

Could you please help me?

2 Answers

The extra "pi" in the output is simply because - is not a word character, and therefore pi in pi-neaple matches b(pi)b.

StringMatchQ["-", RegularExpression["\w"]]
(*False*)

You can use the following pattern to add - to word characters:

(?<![w-])(pi)(?![w-])

which leads to one less pi in the result:

StringCases[
  sample2,
  RegularExpression["(?<![\w\-])(pi)(?![\w\-])"]
]
(*{"pi", "pi", "pi", "pi"}*)

To ensure that these are the right pis, we can use the following test case:

StringCases[
  "pi1 foo-pi2 pi3-foo foo-pi4-bar api5 pi6peline pi7 pi8", 
   RegularExpression["(?<![\w\-])(pi\d)(?![\w\-])"]
]
(*{"pi1", "pi7", "pi8"}*)

About the pattern

The pattern b(pi)b means

pi which is not preceded by a word character (w) and is not followed by a word character.

All we need to do here is to replace by a word character with by a word character or a dash.

For this we can use lookarounds, which are explained, e.g., here. In a nutshell, (?<!foo)bar means bar not preceded by something matching foo, and foo(?!bar) means foo not followed by something matching bar.

Answered by Anton.Sakovich on October 22, 2021

As a start, try using s, which stands for any white space character.

StringCases[
 sample2,
 RegularExpression["\s+(pi)\s+"] -> "$1",
 Overlaps -> True
 ]

{"pi", "pi", "pi", "pi"}

Read towards the end of this answer for more information on how to make this more robust.

The corresponding Wolfram Language string pattern is this:

StringCases[
 sample2,
 Whitespace ~~ s:"pi" ~~ Whitespace -> s,
 Overlaps -> True
 ]

{"pi", "pi", "pi", "pi"}

It is at least functionally equivalent in this case, but it does not use the exact same regular expression. We can see what regular expression it translates the string pattern into like this:

StringPattern`PatternConvert["[\s\n]+(pi)[\s\n]+"] // First

"(?ms)\[\\s\\n\]\+\(pi\)\[\\s\\n\]\+"

(Mathematica threw in a couple of extra backslashes for good measure upon copying the pattern.)

Robustification

user1066 has identified issues with the regex solution. First, it doesn't work if the string starts or ends with a pi. Second, it doesn't work if there are more than two spaces.

One possible way to patch the solution to work for these cases is:

StringCases[
 StringReplace[s, " " .. -> " "], {
  RegularExpression["\s+(pi)\s+"] -> "$1",
  RegularExpression["^(pi)\s+"] -> "$1",
  RegularExpression["\s+(pi)$"] -> "$1"
  },
 Overlaps -> True
 ]

user1066 found the following solution which neatly packs these patterns into one regex:

StringCases[
 s,
 RegularExpression["(?i)(^|\s)(pi)($|\s)"] -> "$2",
 Overlaps -> True
 ]

Answered by C. E. on October 22, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP