Encoding Tags for Random Forest

Question

I have the following data set:

I want to use attributes Tags and Authors to classify each record into their respective Rating. In order to do so I want to use a random forest classifier. My concern is how to deal with Tags attribute. Each of the entry has an undetermined number of tags separated by a commas. There are a total of 4412 unique tags and the entry with more tags contains 20 tags. The first entry has tags ["Rhode Island","Economy", "Taxes", "Lincoln Chafee"].
How should I encode this attribute such that I can use Random Forest Classifier from sklearn?

10xAI · Accepted Answer

You should use sklearn MultiLabelBinarizer

from sklearn.preprocessing import MultiLabelBinarizer
lb = MultiLabelBinarizer()

lb.fit_transform([['A', 'B', 'C'],[ 'A', 'D', 'E', 'B']])

array([[1, 1, 1, 0, 0],
$hspace{1cm}$ [1, 1, 0, 1, 1]])

If required, remove the columns below a threshold value (sum of the column). This will reduce the Features count by removing the low variance Features

Encoding Tags for Random Forest

One Answer

Add your own answers!

Ask a Question