Data Science Asked on April 3, 2021
when we are scaling the data i needed some clarification. so for preventing data leakage we split the train and test sets and then perform the scaling on them separately, correct?
so while scaling or label encoding the data in the train and test datasets,
i mean what is the industry standard of feature transformation?
You should standardise the train
set and apply the same standardisation to the test
set. Here is one option to do this in Python:
# Data
from keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()
# Get values to standardise
mean = train_data.mean(axis=0)
std = train_data.std(axis=0)
# Apply to data
train_data -= mean
train_data /= std
test_data -= mean
test_data /= std
Answered by Peter on April 3, 2021
For point 1. and 2., yes. And this is how it should be done with scaling. Fit a scaler on the training set, apply this same scaler on training set and testing set.
Using sklearn:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.fit(X_test)
Regarding binarizing, I think you should not have this problem. Only if you want to choose the threshold dynamically regarding a dataset. If so, you should only use training dataset to choose it.
Answered by etiennedm on April 3, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP