TransWikia.com

How to deal with name strings in large data sets for ML?

Data Science Asked by Danny Abstemio on September 29, 2021

My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.

Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn’t be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side (Label binarization) and no direct comparison between names on the other side (Label encoding).

Are there other approaches to use or transform especially name information in order to work with ML algorithms?

One Answer

You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.

The following two videos will give an excellent explanation:

However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.

I hope this helps, any question let a comment.

Answered by Victor Oliveira on September 29, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP