Data Science Asked on July 29, 2021
I’m currently working with a dataset of 55k records and seven columns (one target variable), three of which are nominal categorical. The other three are ‘description’ fields with high cardinality, as would be expected in many cases of description data:
in>>
df[['size description', 'weight Description', 'height Description']].nunique()
out>>
size Description 4066
weight Description 736
height Description 3173
dtype: int64
some examples of these values could be:
Product Product Description
--------- ------------------------
Ball Round bouncy toy for kids
Bat Stick that kids use to hit a ball
Go-Kart red/black Small motorized vehicle for kids
Go-Kart blue/green Small motorized vehicle for kids
Wrench Tool for tightening or loosening bolts
Ratchet Tool for tightening or loosening bolts
Reclining arm-chair Cushioned seat for lounging
I think that the descriptions are standardized if they fall within a particular category but at this time I cannot confirm if the number of unique descriptions are finite. At this time, my assumption would be to treat these as nominal-categorical, as these are literally descriptive and not qualitative.
To that end, my question is what are some best practices for handling categorical features such as these?
Things I have considered:
Label encoding is obviously not viable in this situation, as the descriptions have no hierarchy.
One-hot encoding seems an unlikely solution as it balloons the shape of the dataset from (55300 , 6) to (55300 , 65223) due to the high cardinality of the other variables. However, I tried it anyway and generated 98% accuracy on my test set but very poor results on an out-of-sample validation set (5k records, 0-5% accuracy). Seems pretty clear it’s over-fitting and thus, not viable.
Hashing, for whatever reason, will not apply to one of the columns, but I suppose it could still be viable. I just need to figure out why it’s not hashing all of my features (I suppose this would be suited best for a separate question?)
PCA – could be viable, but if I’m understanding correctly the cardinality after one-hot encoding is too great and PCA will throw an error. In fairness, I have not tried this yet.
Binning doesn’t seem feasible since I could have a value of 3.5 or three and 1/2. Each one would be considered a separate bin and thus not a solution to my problem.
Thanks to all that can share their insight/opinion.
You have textual descriptions, i.e. unstructured data. So you should probably use one of the standard representation methods for text. There are many options including sentence embeddings and this kind of advanced methods, but I'm going to describe the simple traditional option:
[ obsolete answer to the first version of the question ]
I would definitely try to normalize these values, because semantically they are numerical and in their original form they are almost useless, no matter the method to categorize them. Making them categorical loses a lot of information, especially the ones which actually provide a number. Since some of the strings used are very vague I would probably try using intervals, i.e. two numeric values for every original input value:
Normally there are not that many ways to give numbers as words, only a few patterns should be enough to capture all the variants. For the other non-standard cases such as "kinda short" I would start by looking in the data their distribution: if a value is frequent enough (e.g. probably "short", "tall") then manually predefined range. Values which are not frequent can be ignored, e.g. replaced with NA (since they are not frequent that shouldn't affect the data too much).
Answered by Erwan on July 29, 2021
Your considerations regarding label encoding, one-hot encoding, and the likes are fairly accurate. For dealing with descriptive data, converting each description into a meaningful vector surely helps.
Meaningful vectors are vectors that catch the essence of what is being said in the descriptions.
Let,
football = "round bouncy toy for kids to kick"
basketball = "round bouncy toy for kids to dribble"
spark_plug = "device for delivering electric current from an ignition system to the combustion chamber of a spark-ignition engine"
Meaningful vectors will have properties such that,
dist(vector(football), vector(basketball)) < dist(vector(football), vector(spark_plug))
For vector representation learned on common English corpora, properties like the following emerge -
vector('king') - vector('male') + vector('female') = vector('queen')
Word2Vec is an efficient way to calculate word vectors - vectors that capture the meaning of individual words. To use Word2Vec you can either -
For your descriptions, you can average out the vectors of its constituent words, thereby treating each description as a bag of words or you can use sophisticated techniques like Doc2Vec, which is based on Word2Vec but tries to capture the relative order of words.
Gensim's implementation of Word2Vec and Doc2Vec has a fairly simple API exposed which you can quickly learn to use for your particular task.
Answered by Yash Jakhotiya on July 29, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP