TransWikia.com

Encoding Tags for Random Forest

Data Science Asked by GGS on November 24, 2020

I have the following data set:
enter image description here

I want to use attributes Tags and Authors to classify each record into their respective Rating. In order to do so I want to use a random forest classifier. My concern is how to deal with Tags attribute. Each of the entry has an undetermined number of tags separated by a commas. There are a total of 4412 unique tags and the entry with more tags contains 20 tags. The first entry has tags ["Rhode Island","Economy", "Taxes", "Lincoln Chafee"].

How should I encode this attribute such that I can use Random Forest Classifier from sklearn?

One Answer

  1. You should use sklearn MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
lb = MultiLabelBinarizer()

lb.fit_transform([['A', 'B', 'C'],[ 'A', 'D', 'E', 'B']])

array([[1, 1, 1, 0, 0],
$hspace{1cm}$ [1, 1, 0, 1, 1]])

  1. If required, remove the columns below a threshold value (sum of the column). This will reduce the Features count by removing the low variance Features

Correct answer by 10xAI on November 24, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP