Data Science Asked by Praveenks on December 6, 2020
I am trying to preprocess my dataset and needs some suggestion on it.
The training data shape is : (166573, 14)
The distribution of features :
As you can see, only the first 4 columns go to different max values. Rest of the columns have either 1 or 0 value (max: 1, min: 0)
**Null Handling: **
I have dropped claims_daysaway column as most of the values are NULL and replace tier‘s NaN values with its mean value.
Scaling features:
I have scaled features where max value varies and left others untouched.
X['scaled_distance']= sc.fit_transform(X['distance'].values.reshape(-1,1))
X['scaled_visit_count'] = sc.fit_transform(X['visit_count'].values.reshape(-1,1))
X['scaled_tier'] = sc.fit_transform(X['tier'].values.reshape(-1,1))
Is this right approach? or should I scale all features?
Depends on what is your goal with the data.
If you are trying to model your variable "outcome" as a classification problem (I am supposing this) then it depends on the modelling technique you are expecting to use.
Some techniques use scalled data as input because they have a functional form which allows them to perform better if the scale is the same for the variables (the canonic example could be Neural Networks).
Some techniques do not need to have scalled inputs, mostly because their functional form does not depend on relationships (interactions) with other variables just as random trees.
There are other techniques like logistic regression where scalling is optional because they work fine without scalling, the result is the same but sometimes you need to compare the resulting $beta$'s and the only way to do this is if the $X$s have the same scale.
The process you are doing is ok, but make sure you actually need to do it.
Answered by Juan Esteban de la Calle on December 6, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP