Does the test set has to be in [0,1] range?

Question

I have standardized training set using
mean = XTrain.mean()
XTrain-=mean

std = XTrain.std()
XTrain/=std

And then used mean and std to standardize validation and test sets. The training and validation sets have values that are greater than 1 and less than zero is that okay?

Sammy · Accepted Answer

Standardization centers the values around a mean of $0$ with standard deviation $1$. Therefore, having values smaller than $0$ or greater than $1$ is to be expected. If you want to make sure values are between $0$ and $1$ you need to normalize the data instead.
Here is an example of the two procedures taken from the book "Python Machine Learning" by Raschka:

Be aware though to apply the procedure to your test data with parameters obtained from the training data (in case of standardization: mean and std. dev. of train data).
Sklearn has methods for standardization and normalization which you might want to have a look at.

Dave · Answer

You're measuring how many standard deviations from the mean a given value is. Certainly values can be many standard deviations from the mean. Even for data with a normal distribution, we expect about $5%$ of the observations to be more than $2$ standard deviations from the mean, and we expect $32%$ of the observations to be more than $1$ standard deviation from the mean.
Therefore, it is not at all concerning that you have values more than $1$.
As far as values less than $0$ go, all that means is that you have a value less than the mean. This is common. (While it can happen, consider how to have a data set where no values are less than the mean.)
As Sammy mentioned mere seconds before I posted, be sure to use the mean and standard deviation from your training data when you transform the test and validation data.

Does the test set has to be in [0,1] range?

2 Answers

Add your own answers!

Ask a Question