Data Science Asked by user83221 on August 16, 2021
Let’s say I am trying to predict whether a car will be auctioned or not (not what I’m actually trying to do, but it represents it pretty well) using tabular data. I have the year the car was made, its color, model, etc. The model is the name of a car(e.g: Sportage, Mazda3, etc.) and some of the more famous models such as Sportage appear many times whereas some of the less popular ones might appear only once or twice. In that case, what would be the ideal way to deal with this?
More info:
In my case, I have about 3000 different car models and the first two or three make up about 20% of my data but the rest just appear once or twice in the entire dataset. I have tried one-hot encoding and that did increase my score immensely but it’s still not good enough (I know as a matter of fact it could be better).
P.S: I have already looked at the posts regarding a high cardinality and although I do think it’s related to my problem, it’s still a different issue.
Thank you so much!
Since a few cars models make your 20% data, you can create a similarity matrix of all the car classes with these 2-3 car models. So in this matrix, each car model has 2-3 values depicting its similarity with those 2-3 car models. You can now add these new features to your existing features to get some improvement over the classification of the underrepresented car models.
You may play with the number of cars models to compare each car model to. Here, these 2-3 car models cover 20% of your data. It might be that 30 car models would cover say 40% of the data which is still useful as there are 3000 car models.
Answered by user1825567 on August 16, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP