# When to use Standard Scaler and when Normalizer?

Data Science Asked by Heisenbug on January 6, 2021

I understand what Standard Scalar does and what Normalizer does, per the scikit documentation: Normalizer, Standard Scaler.

I know when Standard Scaler is applied. But in which scenario is Normalizer applied? Are there scenarios where one is preferred over the other?

• StandardScaler : It transforms the data in such a manner that it has mean as 0 and standard deviation as 1. In short, it standardizes the data. Standardization is useful for data which has negative values. It arranges the data in a standard normal distribution. It is more useful in classification than regression. You can read this blog of mine.

• Normalizer : It squeezes the data between 0 and 1. It performs normalization. Due to the decreased range and magnitude, the gradients in the training process do not explode and you do not get higher values of loss. Is more useful in regression than classification. You can read this blog of mine.

Answered by Shubham Panchal on January 6, 2021

They are used for two different purposes.

StandardScaler changes each feature column $$f_{:,i}$$ to $$f'_{:,i} = frac{f_{:,i} - mean(f_{:,i})}{std(f_{:,i})}.$$

Normalizer changes each sample $$x_n=(f_{n,1},...,f_{n,d})$$ to $$x'_n = frac{x_n}{size(x_n)},$$ where $$size(x_n)$$ for

1. l1 norm is $$left | x_n right |_1=|f_{n,1}|+...+|f_{n,d}|$$,
2. l2 norm is $$left | x_n right |_2=sqrt{f^{2}_{n,1}+...+f^{2}_{n,d}}$$,
3. max norm is $$left | x_n right |_infty=max{|f_{n,1}|,...,|f_{n,d}|}$$.

To illustrate the contrast, consider data set $${1, 2, 3, 4, 5}$$ which consists of 5 one dimensional data points (each data point has one feature),
After applying StandardScaler, data set becomes $${-1.41, -0.71, 0. ,0.71, 1.41}$$.
After applying any type of Normalizer, data set becomes $${1., 1., 1., 1., 1.}$$, since the only feature is divided by itself. So Normalizer has no use for this case. Also, when features have different units, e.g. $$(height, age, income)$$, Normalizer is not used as a pre-processing step; although, it might be used as an ad-hoc feature engineering step similar to what a neuron does in a neural network.

As mentioned in this answer, Normalizer is mostly useful for controlling the size of a vector in an iterative process, e.g. a parameter vector during training, to avoid numerical instabilities due to large values.

Answered by Esmailian on January 6, 2021

I don't feel like previous answers answered the question at all. So I'll give a quite comprehensive explanation with two concrete use case at the end.

Normalizer normalizes rows (samplewise), not columns (featurewise). It totally changes the meaning of data because the distributions of resulting feature values are totally changed. Therefore, a scenario where it can be useful is when you consider a feature to be the relation between feature values samplewise rather than featurewise.

For example, take dataset:

   weight  age
0      45   87
1      40   13
2      56   84


After using Normalizer(norm="l2"), it becomes:

   weight   age
0    0.46  0.89
1    0.95  0.31
2    0.55  0.83


As you can see, the distributon of samples at feature level changed on several aspects:

1. Before, argsort(weight) gave [1, 0, 2]. It now gives [0, 2, 1]. It doesn't change for age but it's just by chance, on a bigger dataset it would change with very high probability.
2. You can see the distribution (probability density function) of each feature is modified since before age[0] was 6.7 times bigger than age[1] but it's now 2.9 times bigger.

Normalizer builds totally new features that are not correlated to initial features. Run Python code provided at the end of the notebook to observe the phenomenon.

StandardScaler and other scalers that work featurewise are preferred in case meaningful information is located in the relation between feature values from one sample to another sample, wherease Normalizer and other scalers that work stamplewise are preferred in case meaningful information is located in the relation between feature values from one feature to another feature.

For example, several studies showed that weight correlates to lifespan in humans and other mammals (after adjusting for sex, height, geographic origin, etc.). As a result, you can see heavy old people as anomalies. One may be interested in why some heavy people live longer and why some thin people live shorter. Then one may want to look at patterns between weight and age. Maybe it exist different groups where each group has it's own mediator variable from weight to lifespan, etc. As you can see, this consists in a clustering tasks on Normalized features.

Another example is when you want to cluster documents by topic. Somewhat, what defines a topic is the frequency of each word relative to another in the document. For example, topic 'statistics' may be caracterized by a relative frequency of word 'variance' over word 'apple' of 12345 (it's random words and frequencies, in real life you would use much more than 2 words). Topic 'verbiage' may be caracterized by a high prominence of linking words and adverbs with regard to nouns and verbs. Therefore, if your initial features are the frequency of each word in the document (from a predefined dictionary), you can use Normalizer to get the appropriate relative features that you want. This example is provided by scikit-learn in "Clustering text documents using k-means"

Lastly, in case you ask, Normalizer scales to unit norm for practical numerical reasons (stability, convergence speed, interpretation, etc.) just like StandardScaler.

### Code

Requirements: seaborn==0.11.0

import numpy.random as rd
import pandas as pd
from sklearn.preprocessing import Normalizer
import seaborn as sns

shape = (100, 2)
df = pd.DataFrame(rd.rand(*shape) * rd.lognormal(1, 0.4, shape), columns=["weight", "age"])
ndf = pd.DataFrame(Normalizer(norm="l2").fit_transform(df), columns=["norm_weight", "norm_age"])
sns.kdeplot(data=pd.concat([df, ndf], axis=1))
for d in [df, ndf]:
sns.pairplot(d.reset_index(), hue="index", diag_kind=None)


On the pairplot of normalized data (third figure), norm_weight in function of norm_age makes a circle arc. It's because the $$L_2$$ norm places data points on the unit circle. Indeed, features are built such that norm_weight ** 2 + norm_age ** 2 == 1.

Answered by Alexandre Huat on January 6, 2021