Learning rate Scheduler

Question

A very important aspect in deep learning is the learning rate. Can someone tell me, how to initialize the lr and how to choose the decaying rate. I'm sure there are valuable pointers that some experienced people in the community can share with others. I've noticed that many choose to do a custom scheduler rather than use available ones.

Can someone tell me why and what influences the change in the lr? And when to describe a lr as being small, medium or large? I want to understand it enough to actually make sound choices. Thank you kind souls. I appreciate this community very much.

serali · Answer

Finding an optimal learning rate is an important step in optimizing a neural network. As discussed at length here it is not a trivial question, but there are some ways to get a good starting value. Main idea here is to plot loss vs learning rate for different values and choose the learning rate where the slope is the highest:

This image is from the link above. This figure tells me that anything from a little right of 10-5 to around 10-3 can be a good learning rate. One can also set a relatively high learning rate, and reduce it when the loss function reaches a plateau, so in the above example, you would be better off by starting with the highest end; and lower the rate later as needed. This can be achieved by a learning rate scheduler (such as the one in Keras callbacks). This way, you won't spend a lot of time at the initial epochs where there is a lot to learn and the loss drops quickly. Three key parameters of a scheduler are "factor","patience" and "min delta" . If the loss function does not change by "min delta" in "patience" number of epochs, the learning rate is reduced by the ratio defined in "factor".

You also mention decay in your post, which can be considered as the default learning rate for some optimizers. And if I remember correctly, it actually associates a different learning rates for each parameter, and reduces them at a different rate. You can find a more detailed comparison here. But the main reason to use a LR scheduler even when a decay is present is to have more control on the learning process. When the training reaches a plateau, you might have to wait for a long time before the default decay in Adam (for example) gets to a low enough value to get over that region and start learning again. But with a scheduler, you can define when to lower the learning rate by a significant amount (compared to default decay), so that further learning is possible. Keep in mind that there is not much to learn in the later epochs compared to the ones in the beginning.

Learning rate Scheduler

One Answer

Add your own answers!

Ask a Question