Data Science Asked by Gilad Deutsch on July 17, 2021
I saw that there are a lot of string similarity algorithms to whether two strings are the same. I have a slightly different problem – I get two strings "a" and "b" and I need a similarity algorithm to whether "b" contains "a" ("b" is likely to contain "a" with some "error"). I haven’t found any algorithms to solve this problem and I’d be happy to hear about some (if they exist).
My first thought was that it's difficult to formally define this concept:
"b" is likely to contain "a" with some "error"
a
is a substring of b
. This question should have a boolean answer: either it does contain it, or it doesn't.b
should contain a substring c
which is "similar enough" to a
. In general the question of similarity between two strings is answered with a numerical value, usually a real number between 0 and 1.As far as I know the only way to solve this issue is to consider that there is a threshold on the similarity score between a
and c
, where c
is any substring of b
. This way the answer becomes boolean.
However there might be a way around this by considering the substring operation as part of the similarity calculation. In particular the Levenshtein edit distance can account for insertions/deletions of characters, which is what a substring is w.r.t the string which contains it.
More interestingly, it is possible to assign a different cost to any particular edit operation in the Levenshtein distance. So it's probably possible to define a variant of Levenshtein where insertions at the beginning or at the end would have cost 0, thus making the final distance $x$ between a
and b
equivalent to "b
contains a substring c
which has distance $x$ against a
".
The way I would try to implement this is:
Note that there might be flaws in my idea, I didn't try it :D
Answered by Erwan on July 17, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP