Encoding features for multi-class classification

Question

I have a question regarding how to setup a dataset for modeling.
Let’s say I have a dataset representing which car a person will buy depending on some characteristics:
The dependent variables are individual cars (Car 1, Car 2, … Car 100).
The independent variables are:
Budget (of the buyer)
Favorite Color (of buyer)
…..
…..
Color (of Car 1)
Color (of Car 2)
….
Color (of Car 100)
MPG (of Car 1)
MPG (of Car 2)
…..
MPG (of Car 100)
Let’s assume this is a multi-class classification problem. So, only one of the cars can be chosen in each situation.
My question is: is it appropriate to have independent variables like that - that are specific to each of the dependent variables? (Color of Car X, MPG of Car X, …). Is it appropriate to just fit a row like that into a model? How does the model know to understand that each of the Colors are discussing the same feature? Color
Lastly, is there a name for this type of data/problem? I'm not sure how to look for it on Google.

Brian Spiering · Answer

Color is a categorical feature.
One of the most common methods to encode categorical features is one-hot encoding. Color could be encoded as an indicator vector. The color of the current car would have a 1 at the appropriate index. For example, [1, 0, 0, …, 0] for a red car and [0, 1, 0, …, 0] for a blue car.
There are other options for encoding categorical features such as binary, count, hash, or label encoding.

Encoding features for multi-class classification

One Answer

Add your own answers!

Ask a Question