TransWikia.com

Data scaling for training and test sets

Data Science Asked on April 3, 2021

when we are scaling the data i needed some clarification. so for preventing data leakage we split the train and test sets and then perform the scaling on them separately, correct?

so while scaling or label encoding the data in the train and test datasets,

  1. how do we ensure that the scaling on test set is according to the train set because fit_transform on the train and test sets scale the features differently. so do we just fit on the train set first and transform the train and test set later?
  2. can we save the scaler so that the new data scaling is done as per the scaler used during training
  3. the same goes for label encoders and binarizer

i mean what is the industry standard of feature transformation?

2 Answers

You should standardise the train set and apply the same standardisation to the test set. Here is one option to do this in Python:

# Data
from keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) =  boston_housing.load_data()

# Get values to standardise
mean = train_data.mean(axis=0)
std = train_data.std(axis=0)

# Apply to data
train_data -= mean
train_data /= std
test_data -= mean
test_data /= std

Answered by Peter on April 3, 2021

For point 1. and 2., yes. And this is how it should be done with scaling. Fit a scaler on the training set, apply this same scaler on training set and testing set.

Using sklearn:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.fit(X_test)

Regarding binarizing, I think you should not have this problem. Only if you want to choose the threshold dynamically regarding a dataset. If so, you should only use training dataset to choose it.

Answered by etiennedm on April 3, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP