Data Science Asked by dooder on May 27, 2021
I have a question regarding how to setup a dataset for modeling.
Let’s say I have a dataset representing which car a person will buy depending on some characteristics:
The dependent variables are individual cars (Car 1, Car 2, … Car 100).
The independent variables are:
Budget (of the buyer)
Favorite Color (of buyer)
…..
…..
Color (of Car 1)
Color (of Car 2)
….
Color (of Car 100)
MPG (of Car 1)
MPG (of Car 2)
…..
MPG (of Car 100)
Let’s assume this is a multi-class classification problem. So, only one of the cars can be chosen in each situation.
My question is: is it appropriate to have independent variables like that – that are specific to each of the dependent variables? (Color of Car X, MPG of Car X, …). Is it appropriate to just fit a row like that into a model? How does the model know to understand that each of the Colors are discussing the same feature? Color
Lastly, is there a name for this type of data/problem? I’m not sure how to look for it on Google.
Color is a categorical feature.
One of the most common methods to encode categorical features is one-hot encoding. Color could be encoded as an indicator vector. The color of the current car would have a 1 at the appropriate index. For example, [1, 0, 0, …, 0] for a red car and [0, 1, 0, …, 0] for a blue car.
There are other options for encoding categorical features such as binary, count, hash, or label encoding.
Answered by Brian Spiering on May 27, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP