Why are results without Transfer Learning better than with Transfer Learning?

Question

I developed a neural network for license plate recognition and used the EfficientNet architecture (https://keras.io/api/applications/efficientnet/#efficientnetb0-function) with and without pretrained weights on ImageNet and with and without data augmentation. I only had 10.000 training images and 3.000 validation images. That was the reason I applied Transfer learning and image augmentation (AdditiveGaussianNoise).
I created this model:
efnB0_model = efn.EfficientNetB0(include_top=False, weights="imagenet", input_shape=(224, 224, 3))
efnB0_model.trainable = False

def create_model(input_shape = (224, 224, 3)):
    input_img = Input(shape=input_shape)
    model = efnB0_model (input_img)
    model = GlobalAveragePooling2D(name='avg_pool')(model)
    model = Dropout(0.2)(model)
    backbone = model

branches = []
    for i in range(7):
            branches.append(backbone)
            branches[i] = Dense(360, name="branch_"+str(i)+"_Dense_360")(branches[i])
            branches[i] = BatchNormalization()(branches[i])
            branches[i] = Activation("relu") (branches[i])
            branches[i] = Dropout(0.2)(branches[i])                        
            branches[i] = Dense(35, activation = "softmax", name="branch_"+str(i)+"_output")(branches[i])
        
    output = Concatenate(axis=1)(branches)
    output = Reshape((7, 35))(output)
    model = Model(input_img, output)

return model

I compiled the model:
opt = keras.optimizers.Adam(learning_rate=0.0001)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=["accuracy"])

And used this code to fit it:
hist = model.fit(
    x=training_generator, epochs=10, verbose=1, callbacks=None,
    validation_data=validation_generator, steps_per_epoch=num_train_samples // 16,
    validation_steps=num_val_samples // 16,
    max_queue_size=10, workers=6, use_multiprocessing=True)

My hypotheses were:
H1: The EfficientNet architecture is applicable to license plate recognition.
H2: Transfer learning will improve accuracy in license plate recognition (compared to the situation without Transfer Learning).
H3: Image augmentation will improve accuracy in license plate recognition (compared to the situation without it).
H4: Transfer Learning combined with Image augmentation will bring the best results.
I now got this results:

So, H1 seems to be correct. But H2, H3 and H4 seem to wrong.
I was thinking about it and got an explanation for H3 and H4, which seem to be logical for me. That is, that image augmentation is too heavy and deteriorates the quality of images by a degree which makes it very hard for the network to recognize the characters.
1. Is this a suitable explanation and are there other ones additionally?
It seems to be the case, that image augmentation was too strong. So, first question is solved.
Regarding H2 I am little confued to be honest. The network seems to overfit but stagnates completely regarding validation accuracy. So, the conclusion that the Imagenet weights are not applicable seems not logical to me because the network learnt something for the training data. I also excluded the possibility that the data volume is to small since we had that good recognition rates without using Transfer learning or image augmentation...
2. Is there any logical explanation for this?

etiennedm · Accepted Answer

As @fuwiak mentioned, transfer learning may not work if pre-trained model has been fitted on a "very different" dataset. Typically if the pre-trained network extract information that is not relevant for your problem.
Moreover, in the paper License Plate Recognition System Based on Transfer Learning (that you shared with me), they have tried to freeze some layers of a pretrained Xception (based on ImageNet weights) to see the impact on the training. They conclude that ImageNet data and license plate data are too different to freeze layers. So your results are confirmed.
Now changing efnB0_model.trainable = False to True would allow the pre-trained network to update and to be more relevant to your problem. Generally, if you don't have time issues, it seems to be always better (see this post). Will it give better results than initialize the weights randomly ? I think one can presume but cannot know.

fuwiak · Answer

At least two issues:
Negative transfer

Transfer learning working if the initial and our problem are similar. Unfortunately, we think that there are similar enough, but its just illusion.

Data greedy

Often model start working well, if we provide much more data.

Why are results without Transfer Learning better than with Transfer Learning?

2 Answers

Add your own answers!

Ask a Question