Should you use random state or random seed in machine learning models?

Question

I'm starting to study machine learning. All the examples I saw, the person that created the ML model used a random state or a random seed to stop the randomness of the process. But, in real life, when you're trying to apply a machine learning model into an actual project of a company, should you use any random state or seed? Is it right (in data science terms) to set a random state to the machine learning model and make reproducible results?

Claude COULOMBE · Answer

During the experiment, for tune-up and reproducibility, you fix temporarily random state but you repeat the experiment with different random state and take the mean of the results
# Set a Random State value
RANDOM_STATE = 42

# Set Python random a fixed value
import random
random.seed(RANDOM_STATE)

# Set numpy random a fixed value
import numpy as np
np.random.seed(RANDOM_STATE)

# Set other library like TensorFlow random a fixed value
import tensorflow as tf
tf.set_random_seed(RANDOM_STATE)

# Eventually don't forget to set random_state parameter in function like
RandomizedSearchCV(random_state = RANDOM_STATE, ...)

For production system, you remove random state by setting it to None
# Set a Random State value
RANDOM_STATE = None

Peter · Answer

Setting a seed or fixing a random state controls randomness. When you want to do "controlled experiments", you need to control randomness to some extent to achieve reproduceable (and by that also comparable) results.
You should have a good idea where it is necessary to control randomness: E.g. when you use linear regression or logistic regression, the results will always be the same (provided you use the same data and model specification). However, when you randomly split a data set for test and training, randomness will affect your test/train split.
Now say you want to compare different model specifications of a linear regression to see what the best model is and you use a test/train set. In order to compare different linear model specifications, you should use the same data for training/testing. So in this case, you would need to set a seed in the test/train split. Otherwise - if you don't set a seed - changes in the model can originate from two sources. A) the changed model specification and B) the changed test/train split.
There are also a number of models which are affected by randomness in the process of learning. Neural nets or boosted model - for instance - will produce somewhat different results after each model run if you don't set a seed. Also in this case, e.g. when you do hyperparameter tuning to find best models, controlling randomness is useful in order to ensure that results from different model runs are comparable.

Donald S · Answer

Your intuition is correct. You can set the random_state or seed for a few reasons:

For repeatability, if you want to publish your results or share them with other colleagues
If you are tuning the model, in an experiment you usually want to
keep all variables constant except the one(s) you are tuning.

I usually set the random_state variable, not the random seed while tuning or developing, as this is a more direct approach. When you go to production, you should remove the random_state and/or random_seed settings, or set to None, then do some cross validation. This will give you more realistic results from your model.

Should you use random state or random seed in machine learning models?

3 Answers

Add your own answers!

Ask a Question