When to normalize or regularize features in Data Science

Question

How can I know when I need to use normalization and when regularization of features.
I know that when I have big difference between min and max values in some feature / column that i need to scale data, also if I use SVM or KNN or Clustering i need to scale the data.

But how can I know when to normalize and when to regularize the features?
How can I know that if I have for example 200 features and 10 000 rows?
Is there any way that I can plot my data and see what can I use?

fuwiak · Answer

1) Normalization makes training data less sensitive to the scale.

2) Regularization - when you get overfitting on training set and had very poor generalization on training set.

3) 200 columns and 100000 rows? Try to dump the dataset into a database or in pandas using:

import pandas as pd
df = pd.read_csv('my_file.csv',chunksize=1000)
for chunk in df:
    #chunk processing

astel · Answer

Step one would be understanding how the algorithms you are using work. Certain algorithms work better (distance based usually) when scaled, others don’t (like random forest). Knowing how an algorithm works will go along way to understanding when (and why) you should scale your data.

Answered by astel on December 28, 2020

jottbe · Answer

Just one comment on regularization, as you use it in the same sentence together with normalization.

Normalization

In my opition, normalization is something that should be considered very early in model evaluation and as stated by the other posters here, it heavily depends on the ML model you use (and of course your data) , how big the influence of normalization is. So for decision tree based models, it probably has little or no influence, while for KNN it most likely has a very big influence.

Regularization

Regularisation on the contrary is something I would do as one of the very last steps. I would try to optimize all other parameters first (roughly), then do regularization and after that, if required I would see if I still can fine-tune some other parameters after that.
Imho regularization should not be one of the first steps, also because according my experience the regularization value has a strong interdependence with almost every other parameter. And that is not the case for all parameters. E.g. if you use lightgbm the parameter for the maximum number of leaves the model can generate, seems to be pretty stable. If you once have optimized it for your ML problem, you can almost certainly leave it unchanged, even if you change other parameters.
So if you can identify such "stable" parameters for your model, it is a good idea to optimize them first.

On the other hand, if regularization influences almost every other parameter, you should probably do this at last to see, how far you can get without regularization and then see, if you can improve that further by applying regularization.

And of course, there are also clear signs, when regularization is required. You just have to look at the validation and train errors and how both progress. Roughly if your trainig errors decrease while your validation errors increase after doing a change, your change increased the overfitting of your model. So if you discover a large gap between training and validation error, then you should check, if you can reduce overfitting and for this regularization is just one possibility.

ASH · Answer

Normalization rescales features to [0,1]. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges. For example, consider a data set containing two features, age, and income. Where age ranges from 0–100, while income ranges from 0–100,000 and higher. Income is about 1,000 times larger than age. So, these two features are in very different ranges. When we do further analysis, like multivariate linear regression, for example, the attributed income will intrinsically influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor. So we normalize the data to bring all the variables to the same range. Normalization is a good technique to use when you do not know the distribution of your data or when you know the distribution is not Gaussian (a bell curve). Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.

Regularization is a feature scaling technique that is intended to solve the problem of overfitting. By adding an extra part to the loss function, the parameters in learning algorithms are more likely to converge to smaller values, which can significantly reduce overfitting. Regularization significantly reduces the variance of the model, without substantially increases its bias. One of the major goals during the training of your machine learning model is to avoiding overfitting. The model will have a low accuracy if it is overfitting the dataset. This happens because your model is trying too hard to capture the noise in your training dataset. By noise we mean the data points that don’t really represent the true properties of your data, but rather randomness. Learning such data points, makes your model more flexible, but it does so at the risk of overfitting to model.

When to normalize or regularize features in Data Science

4 Answers

Normalization

Regularization

Add your own answers!

Ask a Question