Data Science Asked by New Developer on December 3, 2020
The hyperparameter search is computationally expensive. I am wondering if one can tune the hyperparameters independently: tune one hyperparameter for a fixed value of other hyperparameters. For example, let’s say we have two hyperparameters, A and B. We search for best value of A at fixed random B, then we search for best value of B at fixed best value of A.
This make sense if the the other hyperparameters does not interfere with the ordering
of the validation loss for the hyperparameter we want to tune. In that sense, the number of units and the number of layers cannot be tuned independently. According to the Y. Bengio’s paper (link), at some point the mini-batch size can be tuned independently (page 9, right column, The Mini-Batch Size).
But what about the other ones? Learning rate, activation function, dropout, … which one can be tuned independently?
Your question is indeed a little bit broad, but I will try to explain to give you an overview on the importance and on some specific topics.
Indeed, multiple hyper-parameters exist in the context of deep learning. At the same time, just like Andrew Ng mentions in his courses, some are of an bigger importance than the others.
For instance, if you see that your training progresses very slowly (i.e. your convergence is relatively slow), you may want to fine-tune your learning rate.
Learning rate is a quintessential example of hyper-parameter that is more important than the number of neurons in a FullyConnected layer or to change the Dropout rate on a layer from 0.3 to 0.5.
At the same time, there exist two well known techniques for hyper-parameter search: grid search and random search. While the former behaves exactly like you explained (keeping the values of N-1 hyper-parameters fixed and iterating over some specific values of the N hyper-parameter), the random search has proven to have a better positive impact, as it slightly modifies all your hyper-parameters after a search step; although it may not be intuitive at first sight, this can yield better and earlier results than the grid search.
Answered by Timbus Calin on December 3, 2020
As per the paper,The Author has concluded-
"the wisdom distilled here should be taken as a guideline, to be tried and challenged,
not as a practice set in stone
. The practice summarized here, coupled with the increase in available computing power, now allows researchers to train neural networks on a scale that is far beyond what was possible at the time of the first edition of this book, helping to move us closer to artificial intelligence"
So we can't say that its a practice and we can keep everything fixed and only tune learning rate first and then keep learning rate fixed and tune weights.It doesn't even seems reasonable knowing how GD/errors are calculated and weights are updated.
By this-
""the mini-batch size can be tuned independently.""
It seems he meant the we can tune parameters on this fixed batch and take these parameters as guidelines and tune even further. " Hope it helps.
On the other note, I don't see any tables/comparison in the papers which has results of proposed techniques. May be you should contact the author about the understanding.
Answered by BlackCurrant on December 3, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP