Why not using linear regression for finetuning the last layer of a neural network?

Question

In transfer learning, often only the last layer of the network is retrained using gradient descent.
However, the last layer of a common neural network performs only a linear transformation, so why do we use gradient descent and not linear (or logistic) regression to finetune the last layer?

grov · Answer

The common approach to fine-tuning an existing pre-trained neural network is the following:

Given an existing pre-trained neural network model (e.g. imagenet), remove the last layer (which does classification in the pre-training task) and freeze all weights in the remaining layers of the model (usually with setting the trainable parameter to false).
Add a new final dense layer that is to be trained on the new task.
Train the model on the new task's dataset until convergence.
[optional] After fine-tuning is converged, possibly unfreeze the all the layers and train further until convergence with a lower learning rate.

A reason to use gradient descent over a different ML algorithm as you suggest is to enable further training after initial fine-tuning (step #4 above). However, it's not necessary to do this. The approach you suggest (to use take the output of the pre-trained model as input to another ML model) may provide satisfactory performance and be more computationally efficient.
Tradeoffs between these approaches is also discussed in the Keras Transfer Learning guide in the section on the "Typical Transfer Learning Workflow".

Why not using linear regression for finetuning the last layer of a neural network?

One Answer

Add your own answers!

Ask a Question