Should you use random state or random seed in machine learning models?

Question

I'm starting to study machine learning. All the examples I saw, the person that created the ML model used a random state or a random seed to stop the randomness of the process. But, in real life, when you're trying to apply a machine learning model into an actual project of a company, should you use any random state or seed? Is it right (in data science terms) to set a random state to the machine learning model and make reproducible results?

Donald S · Accepted Answer

Your intuition is correct. You can set the random_state or seed for a few reasons.:

For repeatability, if you want to publish your results, share with
other colleagues
if you are tuning the model, in an experiment you usually want to
keep all variables constant except the one(s) you are tuning.

I usually set the random_state variable, not the random seed while tuning or developing, as this is a more direct approach. When you go to production, you should remove the random_state and or random_seed settings, or set to None, then do some cross validation. This will give you more realistic results from your model.

Peter · Answer

Setting a seed or fixing a random state controls randomness. When you want to do "controlled experiments", you need to control randomness to some extent to achieve reproduceable (and by that also comparable) results.
You should have a good idea where it is necessary to control randomness: E.g. when you use linear regression or logistic regression, the results will always be the same (provided you use the same data and model specification). However, when you randomly split a data set for test and training, randomness will affect your test/train split.
Now say you want to compare different model specifications of a linear regression to see what the best model is and you use a test/train set. In order to compare different linear model specifications, you should use the same data for training/testing. So in this case, you would need to set a seed in the test/train split. Otherwise - if you don't set a seed - changes in the model can originate from two sources. A) the changed model specification and B) the changed test/train split.
There are also a number of models which are affected by randomness in the process of learning. Neural nets or boosted model - for instance - will produce somewhat different results after each model run if you don't set a seed. Also in this case, e.g. when you do hyperparameter tuning to find best models, controlling randomness is useful in order to ensure that results from different model runs are comparable.

Should you use random state or random seed in machine learning models?

2 Answers

Add your own answers!

Ask a Question