How to binary encode multi-valued categorical variable from Pandas dataframe?

Question

Suppose we have the following dataframe with multiple values for a certain column:

categories
0 - ["A", "B"]
1 - ["B", "C", "D"]
2 - ["B", "D"]

How can we get a table like this?

"A"  "B"  "C"  "D"
0 - 1    1    0    0
1 - 0    1    1    1
2 - 0    1    0    1

Note: I don't necessarily need a new dataframe, I'm wondering how to transform such DataFrames to a format more suitable for machine learning.

Samuel Harrold · Accepted Answer

If [0, 1, 2] are numerical labels and is not the index, then pandas.DataFrame.pivot_table works:

In []:
data = pd.DataFrame.from_records(
    [[0, 'A'], [0, 'B'], [1, 'B'], [1, 'C'], [1, 'D'], [2, 'B'], [2, 'D']],
    columns=['number_label', 'category'])
data.pivot_table(index=['number_label'], columns=['category'], aggfunc=[len], fill_value=0)

Out[]:
              len
category      A      B      C      D
number_label                       
0             1      1      0      0
1             0      1      1      1
2             0      1      0      1

This blog post was helpful.

If [0, 1, 2] is the index, then collections.Counter is useful:

In []:
data2 = pd.DataFrame.from_dict(
    {'categories': {0: ['A', 'B'], 1: ['B', 'C', 'D'], 2:['B', 'D']}})
data3 = data2['categories'].apply(collections.Counter)
pd.DataFrame.from_records(data3).fillna(value=0)

Out[]:
       A      B      C      D
0      1      1      0      0
1      0      1      1      1
2      0      1      0      1

How to binary encode multi-valued categorical variable from Pandas dataframe?

One Answer

Add your own answers!

Ask a Question