Data Science Asked on November 17, 2021
I’m starting to study machine learning. All the examples I saw, the person that created the ML model used a random state or a random seed to stop the randomness of the process. But, in real life, when you’re trying to apply a machine learning model into an actual project of a company, should you use any random state or seed? Is it right (in data science terms) to set a random state to the machine learning model and make reproducible results?
During the experiment, for tune-up and reproducibility, you fix temporarily random state but you repeat the experiment with different random state and take the mean of the results
# Set a Random State value
RANDOM_STATE = 42
# Set Python random a fixed value
import random
random.seed(RANDOM_STATE)
# Set numpy random a fixed value
import numpy as np
np.random.seed(RANDOM_STATE)
# Set other library like TensorFlow random a fixed value
import tensorflow as tf
tf.set_random_seed(RANDOM_STATE)
# Eventually don't forget to set random_state parameter in function like
RandomizedSearchCV(random_state = RANDOM_STATE, ...)
For production system, you remove random state by setting it to None
# Set a Random State value
RANDOM_STATE = None
Answered by Claude COULOMBE on November 17, 2021
Setting a seed or fixing a random state controls randomness. When you want to do "controlled experiments", you need to control randomness to some extent to achieve reproduceable (and by that also comparable) results.
You should have a good idea where it is necessary to control randomness: E.g. when you use linear regression or logistic regression, the results will always be the same (provided you use the same data and model specification). However, when you randomly split a data set for test and training, randomness will affect your test/train split.
Now say you want to compare different model specifications of a linear regression to see what the best model is and you use a test/train set. In order to compare different linear model specifications, you should use the same data for training/testing. So in this case, you would need to set a seed in the test/train split. Otherwise - if you don't set a seed - changes in the model can originate from two sources. A) the changed model specification and B) the changed test/train split.
There are also a number of models which are affected by randomness in the process of learning. Neural nets or boosted model - for instance - will produce somewhat different results after each model run if you don't set a seed. Also in this case, e.g. when you do hyperparameter tuning to find best models, controlling randomness is useful in order to ensure that results from different model runs are comparable.
Answered by Peter on November 17, 2021
Your intuition is correct. You can set the random_state
or seed for a few reasons:
I usually set the random_state
variable, not the random seed while tuning or developing, as this is a more direct approach. When you go to production, you should remove the random_state
and/or random_seed
settings, or set to None
, then do some cross validation. This will give you more realistic results from your model.
Answered by Donald S on November 17, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP