Standardization on training and split data

Question

I am confused on which of the following should be used for standardization:

method 1: fit transforming training data and only transforming test data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)

method 2: fit transforming both training and test data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
# scaler_train=sc.fit(X_train)
#X_train_sd=scaler_train.transform(X_train)
X_test = sc.fit_transform (X_test)
#scaler_test=sc.fit(X_test)
#X_test_sd=scaler_train.transform(X_test)

this is a follow up question to:
StandardScaler before and after splitting data

Adam Oudad · Answer

You should only fit your scaler on training data. Your scaler is part of your model and fitting your scaler to some data can be considered as learning from this data.
Test data is used to evaluate your model on previously unseen data, so if you fit your scaler to test data, it is not "unseen" data anymore.

Standardization on training and split data

One Answer

Add your own answers!

Ask a Question