Equivalent of numeric encoding when rows can contain multiple values

Question

If we have a column like:

Name
0  Alice
1    Bob
2   Dave

then, after numeric encoding, it becomes:

Name
0  0
1  1
2  2

What if, however, we have a column like this:

Names
0        Alice, Bob
1  Alice, Bob, Dave
2              Dave

One way to encode it would be like this:

Alice  Bob  Dave
0      1    1     0
1      1    1     1
2      0    0     1

However, this creates a lot of extra columns. Is there a way to encode such a column in a way that doesn't end up with loads of extra columns, just like using numeric encoding instead of one-hot-encoding prevents too many columns from appearing in a column like the first one I showed?

If you're using Python, here's some code to reproduce my DataFrame:

import pandas as pd
df = pd.DataFrame({'Names': ['Alice, Bob', 'Alice, Bob, Dave', 'Dave']})

Edit: the ultimate purpose of this is to then pass this through a tree-based classifier.

GrozaiL · Answer

I think, it depends on how many unique values do you have and what do you want to do with this data next. Intermediate table seems good solution for now, but you're expecting overhead. So maybe it would be suitable for you to create bit masks in one-hot manner (Binary encoding).

Answered by GrozaiL on March 11, 2021

Pranav Pandey · Answer

First, do the OneHotEncoder thing with your Name column which will give you three columns as you have stated above, Namely Alice, Bob, and Dave.

We have this Table after doing OneHotEncoding :

Alice  Bob  Dave
0   1      1    0     
1   1      1    1   
2   0      0    1

Now you can call apply function on these columns; pass axis=1 to use the apply function row-wise, then convert the dtype to str and then join them as follows (astype function converts the input into the give data-type):

df['AliceBobDave']= df[df.columns[0:]].apply(lambda x: ','.join(x.dropna().astype(int).astype(str)),axis=1)

The Above will give you output :

Alice  Bob  Dave  AliceBobDave
0    1      1    0     1,1,0
1    1      1    1     1,1,1
2    0      0    1     0,0,1

Now run this command below to remove the columns not needed in the final dataframe. For ex. in this case we remove Alice, Bob, and Dave using the drop function in pandas:

df = df.drop(['Alice','Bob','Dave'], axis = 1)

After executing this command if we print df using print(df), we get:

AliceBobDave
0   1,1,0
1   1,1,1
2   0,0,1

This is your desired output.

Matthew · Answer

See my comments on @GrozaiL's post for my real thoughts on this (use some form of binary encoding, ideally one-hot columns), but if you're really dying for a method to assign values to subsets, why not a Godel Numbering? Assign each person a prime number and multiply them together, that'll give you one number for each subset and if you ever want to figure out which people were in the subset after the fact, you can just use the prime factorization of the assigned number to recover its membership! (Don't do this)

Answered by Matthew on March 11, 2021

Ben Reiniger · Answer

The binary encoding seems to be the most natural, capturing exactly the present relationships and not introducing nonexistent relationships.  The combinations of names then are thought of as vertices of a hypercube; naturally, if you collapse this down to one dimension, you will be introducing nonexistent relationships.  Indeed, this is what happens with label encoding: Bob isn't really any closer to Alice than Dave is, but your label encoding makes that appear to be the case.

I've usually been happy enough with just the binary encoding, so I don't have much to back up the following, but some ideas:

For your case, you could still project the hypercube to one dimension; there are any number of ways to do it, and every one will add some additional information that shouldn't be there.  You could experiment and try to find one that seems most suitable for your purposes.

That should sound reminiscent of word embedding, and suggests another method: use a neural network to generate "entity embeddings", a projection from your hypercube to some smaller-dimensional space.  In this way, you uncover combinations of persons that are useful for your problem (maybe you end up with a feature that's "at least one of Alice and Dave", etc.).  You could also do something more transparent, and do some feature engineering by hand to pick out interesting combinations and binning to keep the number of features manageable.  For this you could consider target encoding each combination.  (Indeed, that's a very easy solution by itself, though target encoding can easily lead to overfitting if you have very many people or some rare combinations of people.)

Finally, you might be able to split a row containing (Alice, Dave) into two, one with Alice and the other with Dave.  You can now label-encode (or whatever else), and will need either a model that understands multiple rows per sample point, or a postprocessing method to deal with combining rows.

lcrmorin · Answer

To reduce the number of columns produced by onehot encoding one solution might be target encoding. That is, if you have the following table :
   Alice  Bob  Dave Other_features Target
0      1    1     0             X1      0
1      1    1     0             X2      1
2      1    1     0             X3      0
3      1    1     1             X4      1
4      1    1     1             X5      1
5      1    1     1             X6      1
6      0    0     1             X7      0
7      0    0     1             X8      0

You can create a table :
Group               Group_Avg_target
Alice,Bob                      0.333
Alice, Bob, Dave                   1
Dave                               0

Then replace the values in your dataset :
   Group_Avg_target Other_features Target
0             0.333             X1      0
1             0.333             X2      1
2             0.333             X3      0
3                 1             X4      1
4                 1             X5      1
5                 1             X6      1
6                 0             X7      0
7                 0             X8      0

That way you'll end up with one column. That mainly works if you have enough instances to have meaningfull averages by groups of people. I would not recommand using that in general as you would lose a lot of interactions betweeen people, which might be bad to explain what happened.
Note : to use this solution you need to make sure that there is no 'target leak', ie use the average calculated on train to encode your test dataset.

Equivalent of numeric encoding when rows can contain multiple values

5 Answers

Add your own answers!

Ask a Question