TransWikia.com

Is it good practice to transform some variables and not others?

Data Science Asked by JMarcos87 on July 6, 2021

I have a dataset with categorical variables encoded into numeric values, other variables that are continuous and have many outliers, and other continuous variable with a fairly normal distribution.

I was planning to use the sklearn preprocessing method .PowerTransformer in order to transform all of them, but maybe it might make more sense to just use it for those columns that have not normal distribution at all and many outliers?

It’s for a classification problem (the Titanic machine learning one).

One Answer

About the question whether to scale only a subset of features, I would tell you to do it over all the features (at least the continuous numeric ones) since the goal of data-scaling is to put these data on the same "reference scale" to be fairly compared.

Nevertheless, having mixed data types (continuous numerical, categorical...) for your classification problem looks more appropriate for scale-invariant algorithms like the ones based on decision trees. More precisely, you can have a look at XGBoost, where the author explains in this link that you do not actually have to re-scale your data.

Actually, in a recent real use case at my company, we tried re-scaling data VS not re-scaling it applying XGB, and we had better results with the second option.

Correct answer by German C M on July 6, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP