Data Science Asked on March 11, 2021
If we have a column like:
Name
0 Alice
1 Bob
2 Dave
then, after numeric encoding, it becomes:
Name
0 0
1 1
2 2
What if, however, we have a column like this:
Names
0 Alice, Bob
1 Alice, Bob, Dave
2 Dave
One way to encode it would be like this:
Alice Bob Dave
0 1 1 0
1 1 1 1
2 0 0 1
However, this creates a lot of extra columns. Is there a way to encode such a column in a way that doesn’t end up with loads of extra columns, just like using numeric encoding instead of one-hot-encoding prevents too many columns from appearing in a column like the first one I showed?
If you’re using Python, here’s some code to reproduce my DataFrame:
import pandas as pd
df = pd.DataFrame({'Names': ['Alice, Bob', 'Alice, Bob, Dave', 'Dave']})
Edit: the ultimate purpose of this is to then pass this through a tree-based classifier.
I think, it depends on how many unique values do you have and what do you want to do with this data next. Intermediate table seems good solution for now, but you're expecting overhead. So maybe it would be suitable for you to create bit masks in one-hot manner (Binary encoding).
Answered by GrozaiL on March 11, 2021
First, do the OneHotEncoder thing with your Name
column which will give you three columns as you have stated above, Namely Alice
, Bob
, and Dave
.
We have this Table after doing OneHotEncoding
:
Alice Bob Dave
0 1 1 0
1 1 1 1
2 0 0 1
Now you can call apply
function on these columns; pass axis=1
to use the apply
function row-wise, then convert the dtype
to str
and then join
them as follows (astype
function converts the input into the give data-type):
df['AliceBobDave']= df[df.columns[0:]].apply(lambda x: ','.join(x.dropna().astype(int).astype(str)),axis=1)
The Above will give you output :
Alice Bob Dave AliceBobDave
0 1 1 0 1,1,0
1 1 1 1 1,1,1
2 0 0 1 0,0,1
Now run this command below to remove the columns
not needed in the final dataframe. For ex. in this case we remove Alice
, Bob
, and Dave
using the drop
function in pandas:
df = df.drop(['Alice','Bob','Dave'], axis = 1)
After executing this command if we print df
using print(df)
, we get:
AliceBobDave
0 1,1,0
1 1,1,1
2 0,0,1
This is your desired output.
Answered by Pranav Pandey on March 11, 2021
See my comments on @GrozaiL's post for my real thoughts on this (use some form of binary encoding, ideally one-hot columns), but if you're really dying for a method to assign values to subsets, why not a Godel Numbering? Assign each person a prime number and multiply them together, that'll give you one number for each subset and if you ever want to figure out which people were in the subset after the fact, you can just use the prime factorization of the assigned number to recover its membership! (Don't do this)
Answered by Matthew on March 11, 2021
The binary encoding seems to be the most natural, capturing exactly the present relationships and not introducing nonexistent relationships. The combinations of names then are thought of as vertices of a hypercube; naturally, if you collapse this down to one dimension, you will be introducing nonexistent relationships. Indeed, this is what happens with label encoding: Bob isn't really any closer to Alice than Dave is, but your label encoding makes that appear to be the case.
I've usually been happy enough with just the binary encoding, so I don't have much to back up the following, but some ideas:
For your case, you could still project the hypercube to one dimension; there are any number of ways to do it, and every one will add some additional information that shouldn't be there. You could experiment and try to find one that seems most suitable for your purposes.
That should sound reminiscent of word embedding, and suggests another method: use a neural network to generate "entity embeddings", a projection from your hypercube to some smaller-dimensional space. In this way, you uncover combinations of persons that are useful for your problem (maybe you end up with a feature that's "at least one of Alice and Dave", etc.). You could also do something more transparent, and do some feature engineering by hand to pick out interesting combinations and binning to keep the number of features manageable. For this you could consider target encoding each combination. (Indeed, that's a very easy solution by itself, though target encoding can easily lead to overfitting if you have very many people or some rare combinations of people.)
Finally, you might be able to split a row containing (Alice, Dave) into two, one with Alice and the other with Dave. You can now label-encode (or whatever else), and will need either a model that understands multiple rows per sample point, or a postprocessing method to deal with combining rows.
Answered by Ben Reiniger on March 11, 2021
To reduce the number of columns produced by onehot encoding one solution might be target encoding. That is, if you have the following table :
Alice Bob Dave Other_features Target
0 1 1 0 X1 0
1 1 1 0 X2 1
2 1 1 0 X3 0
3 1 1 1 X4 1
4 1 1 1 X5 1
5 1 1 1 X6 1
6 0 0 1 X7 0
7 0 0 1 X8 0
You can create a table :
Group Group_Avg_target
Alice,Bob 0.333
Alice, Bob, Dave 1
Dave 0
Then replace the values in your dataset :
Group_Avg_target Other_features Target
0 0.333 X1 0
1 0.333 X2 1
2 0.333 X3 0
3 1 X4 1
4 1 X5 1
5 1 X6 1
6 0 X7 0
7 0 X8 0
That way you'll end up with one column. That mainly works if you have enough instances to have meaningfull averages by groups of people. I would not recommand using that in general as you would lose a lot of interactions betweeen people, which might be bad to explain what happened.
Note : to use this solution you need to make sure that there is no 'target leak', ie use the average calculated on train to encode your test dataset.
Answered by lcrmorin on March 11, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP