Data Science Asked on March 16, 2021
Suppose I have a dataset of size(10000, 45). One of the features in the dataset is activity_type in which the values vary from 1 to 15 as shown below:
df = pd.read_csv('actTrain.csv')
df['activity_type'].head()
The output of the above code is as:
0 1
1 1
2 2
3 1
4 3
Name: activity_type, dtype: int64
Will encoding the activity_type in the above code using OneHotEncoder in sklearn improve the model in anyway? Is it necessary to encode that feature? And if yes, which one should I choose : LabelEncoder or OneHotEnocder?
LabelEncoder
converts strings to integers, but you have integers already. Thus, LabelEncoder will not help you anyway.
Wenn you are using your column with integers as it is, sklearn
treats it as numbers. This means, for example, that distance between 1 and 2 is 1, distance between 1 and 4 is 3. Can you say the same about your activities (if you know the meaning of the integers)? What is the pairwise distances between, for example, "exercise", "work", "rest", "leasure"?
If you think, that the pairwise distance between any pair of activities is 1, because those are just different activities, then OneHotEncoder
is your choice.
Correct answer by lanenok on March 16, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP