Data Science Asked by Canovice on December 23, 2020
A reproducible example with a small bit of R code is available in this stackoverflow post (link so I dont need to re-type out the code). The fuzzytext
library in R has the following available string methods c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")
. Our use case is matching (left-joining) basketball player names from 2 different sources. From the stackoverflow post, we have the following concerns to account for when string matching names:
Michael Gadson
is clearly Mike Gadson
, not one of the other Mike names in the dataset with a different last name.Ricky Smith
is Rick Smith
, he is not Smith Rickie
.III
, Jr.
, etc. suffix to names, or by extra spaces or symbols: eg. De Andre'
vs DeAndre
)Johnny Williams
) in the left-hand-side dataframe have no match in the right-side table. To catch this, we’ll need to rely on a properly selected max_dist
value.A 5th concern is avoiding duplicates in the code (we want only 1 row for each person in the left-hand-side dataframe), however this is handled with the groupby(fullName) %>% filter(dist == min(dist) | is.na(dist))
in the code.
Our question is then: given these concerns, what is a good method and max distance to use for this left join?
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP