Data Science Asked on August 14, 2021
I have a dataset of entries and a variable for the owner of the entry. Some of these people occur more than once. However, the names are sometimes written differently. I want to eventually be able to aggregate the other data to the single owner. These are the names of business owners so sometimes it’s a singular name, sometimes it’s more than one name, and sometimes it’s just the company name. Here’s an example of some of the styles of names in the data:
I’ve never done anything like this before. How could I go about identifying some of the same people? Is there a way to create an index to identify the similarity between these groups? Most of the ones I’ve seen are for longer text. Is there an index well suited for this?
I apologize if this is too basic a question. I’m new to doing things like this and I’m not sure if I know exactly what to search for. I’m most comfortable with Stata and R but I’ve used Python before and I could eventually figure out how to do something with that.
For R: Have a look and the stringr package. I would use for example the str_detect() function as follows: str_detect(column_of_different_names,"DOE|company_name"). This will return TRUE for each string that includes "DOE" or the company name in "company_name".
Correct answer by arne on August 14, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP