TransWikia.com

Should you use random state or random seed in machine learning models?

Data Science Asked on November 17, 2021

I’m starting to study machine learning. All the examples I saw, the person that created the ML model used a random state or a random seed to stop the randomness of the process. But, in real life, when you’re trying to apply a machine learning model into an actual project of a company, should you use any random state or seed? Is it right (in data science terms) to set a random state to the machine learning model and make reproducible results?

3 Answers

During the experiment, for tune-up and reproducibility, you fix temporarily random state but you repeat the experiment with different random state and take the mean of the results

# Set a Random State value
RANDOM_STATE = 42

# Set Python random a fixed value
import random
random.seed(RANDOM_STATE)

# Set numpy random a fixed value
import numpy as np
np.random.seed(RANDOM_STATE)

# Set other library like TensorFlow random a fixed value
import tensorflow as tf
tf.set_random_seed(RANDOM_STATE)

# Eventually don't forget to set random_state parameter in function like
RandomizedSearchCV(random_state = RANDOM_STATE, ...)

For production system, you remove random state by setting it to None

# Set a Random State value
RANDOM_STATE = None

Answered by Claude COULOMBE on November 17, 2021

Setting a seed or fixing a random state controls randomness. When you want to do "controlled experiments", you need to control randomness to some extent to achieve reproduceable (and by that also comparable) results.

You should have a good idea where it is necessary to control randomness: E.g. when you use linear regression or logistic regression, the results will always be the same (provided you use the same data and model specification). However, when you randomly split a data set for test and training, randomness will affect your test/train split.

Now say you want to compare different model specifications of a linear regression to see what the best model is and you use a test/train set. In order to compare different linear model specifications, you should use the same data for training/testing. So in this case, you would need to set a seed in the test/train split. Otherwise - if you don't set a seed - changes in the model can originate from two sources. A) the changed model specification and B) the changed test/train split.

There are also a number of models which are affected by randomness in the process of learning. Neural nets or boosted model - for instance - will produce somewhat different results after each model run if you don't set a seed. Also in this case, e.g. when you do hyperparameter tuning to find best models, controlling randomness is useful in order to ensure that results from different model runs are comparable.

Answered by Peter on November 17, 2021

Your intuition is correct. You can set the random_state or seed for a few reasons:

  1. For repeatability, if you want to publish your results or share them with other colleagues
  2. If you are tuning the model, in an experiment you usually want to keep all variables constant except the one(s) you are tuning.

I usually set the random_state variable, not the random seed while tuning or developing, as this is a more direct approach. When you go to production, you should remove the random_state and/or random_seed settings, or set to None, then do some cross validation. This will give you more realistic results from your model.

Answered by Donald S on November 17, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP