How to deal with name strings in large data sets for ML?

Question

My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.

Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side (Label binarization) and no direct comparison between names on the other side (Label encoding).

Are there other approaches to use or transform especially name information in order to work with ML algorithms?

Victor Oliveira · Answer

You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.

The following two videos will give an excellent explanation:

https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding
https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization

However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.

I hope this helps, any question let a comment.

How to deal with name strings in large data sets for ML?

One Answer

Add your own answers!

Ask a Question