TransWikia.com

Which string distance equation for fuzzy-matching person names is reliable?

Data Science Asked by Canovice on December 23, 2020

A reproducible example with a small bit of R code is available in this stackoverflow post (link so I dont need to re-type out the code). The fuzzytext library in R has the following available string methods c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"). Our use case is matching (left-joining) basketball player names from 2 different sources. From the stackoverflow post, we have the following concerns to account for when string matching names:

  • The left join shouldn’t get mixed up by long / short names. Michael Gadson is clearly Mike Gadson, not one of the other Mike names in the dataset with a different last name.
  • The left join shouldn’t get mixed up by reversed names. Ricky Smith is Rick Smith, he is not Smith Rickie.
  • The left join shouldn’t get mixed up by III, Jr., etc. suffix to names, or by extra spaces or symbols: eg. De Andre' vs DeAndre)
  • Certain players (e.g. Johnny Williams) in the left-hand-side dataframe have no match in the right-side table. To catch this, we’ll need to rely on a properly selected max_dist value.

A 5th concern is avoiding duplicates in the code (we want only 1 row for each person in the left-hand-side dataframe), however this is handled with the groupby(fullName) %>% filter(dist == min(dist) | is.na(dist)) in the code.

Our question is then: given these concerns, what is a good method and max distance to use for this left join?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP