TransWikia.com

Standardization on training and split data

Data Science Asked by nishanth reddy on December 24, 2020

I am confused on which of the following should be used for standardization:

  • method 1: fit transforming training data and only transforming test data

    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform (X_test)
    
  • method 2: fit transforming both training and test data

    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    X_train = sc.fit_transform(X_train)
    # scaler_train=sc.fit(X_train)
    #X_train_sd=scaler_train.transform(X_train)
    X_test = sc.fit_transform (X_test)
    #scaler_test=sc.fit(X_test)
    #X_test_sd=scaler_train.transform(X_test)
    

this is a follow up question to:
StandardScaler before and after splitting data

One Answer

You should only fit your scaler on training data. Your scaler is part of your model and fitting your scaler to some data can be considered as learning from this data.

Test data is used to evaluate your model on previously unseen data, so if you fit your scaler to test data, it is not "unseen" data anymore.

Answered by Adam Oudad on December 24, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP