Data Science Asked by MMMMMay on August 9, 2020
I’m working on a dataset with more than 2000 features. Most of the features contain both numerical values and categorical features. For example, in the feature represents how long has the user been living in the current address, the value could be numbers or some letters which mean the value cannot be obtained due to some reasons.
I don’t know how to process these features. If they are just pure numerical or categorical values, things will much easier. But since they are mixed, I’m really confusing. Could anyone give me some advice?
Update: I may not express clearly that it is not a dataset include both numerical features and categorical features. I mean in one feature, there are both numerical values and categorical values.
For example: (here M, C, T mean that because of different reasons, no exact values can be found)
TOTAL INCOME
3000
5000
M
8000
C
4000
T
The best will be to have domain knowledge and understand how this feature affects the target and what will be the best way to encode it:
First way: treat them all as a categorical feature with high cardinality.
Second way: split them into two columns, one categorical and one numerical. And then treat them separately.
Without domain knowledge in this case I cant think of on something better. Answering this question would help: Is the number really a number? Why can I have numbers and strings in the same column? How will this information might help my model?
Answered by Carlos Mougan on August 9, 2020
First - You should decide whether you have any clue about the meaning of these CHAR or Not.
Let's say, the answer is "No" ( which I doubt).
Then this is no different than a missing record. So, you may try the best possible imputation technique.
If we have a Clue based on the domain knowledge e.g. C means missing for the population of California.
Then you may try a logic accordingly.
Try plotting other Features/Target with these values and observe the pattern to get any available clue.
Last, you may try to encode these values using Target based encoding Or techniques which keep them in one dimension only. Ref-I Ref-II
Answered by 10xAI on August 9, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP