ConvNet - What to improve regarding architecture, procedure and technique?

Question

I have a dataset of 180k images of license plates (so, not necessary to localize the license plate at first) for which I try to recognize the characters on the images (License plate recognition). All of these license plates contain seven characters and 35 characters are possible, so the output vector y is of shape (7, 35). I therefore onehot-encoded every license plate label.
I applied the bottom of the EfficicentNet-B0 model (https://keras.io/api/applications/efficientnet/#efficientnetb0-function) together with a customized top, which is divided in 7 branches (because of seven characters per license plate). I used the weights of the imagenet and freezed the bottom layers of efnB0_model:
efnB0_model = efn.EfficientNetB0(include_top=False, weights="imagenet", input_shape=(224, 224, 3))
efnB0_model.trainable = False

I am using Transfer Learning because the authors in this paper did this as well and got very good results regarding accuracy: https://link.springer.com/chapter/10.1007/978-981-13-1733-0_6
The top of the model is constructed similar to some works in literature following their segmentation-free approaches (so, the step segmentation is not necessary). This are the two papers I used as starting points for the top of my model: https://ieeexplore.ieee.org/abstract/document/8078501 & https://dl.acm.org/doi/abs/10.1145/3009977.3010052
def create_model(input_shape = (224, 224, 3)):
    input_img = Input(shape=input_shape)
    model = efnB0_model (input_img)
    model = GlobalAveragePooling2D(name='avg_pool')(model)
    #model = GlobalMaxPooling2D()(model)
    model = Dropout(0.2)(model)
    #backbone = Flatten() (model)
    backbone = model

branches = []
    for i in range(7):
            branches.append(backbone)
            branches[i] = Dense(360, name="branch_"+str(i)+"_Dense_360")(branches[i])
            branches[i] = Activation("relu") (branches[i])
            branches[i] = BatchNormalization()(branches[i])
            #branches[i] = Activation("relu") (branches[i])
            branches[i] = Dropout(0.2)(branches[i])
            # branches[i] = Dense(128, name="branch_"+str(i)+"_Dense_128")(branches[i])
            # branches[i] = BatchNormalization()(branches[i])
            # branches[i] = Activation("relu")
            # branches[i] = Dropout(0.2)(branches[i])            
            branches[i] = Dense(35, activation = "softmax", name="branch_"+str(i)+"_output")(branches[i])
        
    output = Concatenate(axis=1)(branches)
    output = Reshape((7, 35))(output)
    model = Model(input_img, output)

return model

opt = keras.optimizers.Adam(learning_rate=0.0001)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=["accuracy"])

For training and validating the model I only use 10.000 training images and 3.000 validation images due to the technical constraints in Colab and the huge number of data which would make my training very, very slow (one epoch lasts 45 minutes).
I use this DataGenerator to feed batches to my model and extend the code when I add GaussianNoise or Snowflakes (but I will not paste the code here due to redability):
 class DataGenerator(Sequence):

def __init__(self, x_set, y_set, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

def __len__(self):
        return math.ceil(len(self.x) / self.batch_size)

def __getitem__(self, idx):
        batch_x = self.x[idx*self.batch_size : (idx + 1)*self.batch_size]
        batch_x = np.array([resize(imread(file_name), (224, 224)) for file_name in batch_x])
        batch_x = batch_x * 1./255
        batch_y = self.y[idx*self.batch_size : (idx + 1)*self.batch_size]
        batch_y = np.array(batch_y)

return batch_x, batch_y

I fit the model using this code:
model.fit_generator(generator=training_generator,
                    validation_data=validation_generator,
                    steps_per_epoch = num_train_samples // 16,
                    validation_steps = num_val_samples // 16,
                    epochs = 10, workers=6, use_multiprocessing=True)

These are my results after trying different approaches: When I apply Transfer Learning to the model illustrated above I get overfitting, so huge differences regarding between training and validation accuracy (training accuracy increases but validation accuracy stops at around 0.18x. When I only use the EfficientNet architecture + customized top without any other technique, I get training accuracy and validation accuracy around 0.18x. When I add GaussianNoise or Snowflakes, I also get training and validation accuracy around 0.18x. When I use a more simple model, I also get values around 0.18x.
So, I am wondering, what I should check or what I can improve regarding my model? Do you think, there is something completely wrong with my architecture?
I would say, four reasons are possible:

regarding the overfitting: data volume too low and model too complex
preprocessing (failures, but i do not think so) or inadequate (regarding the scaling of the images and a distortion of the characters)
data quality: different data than in other approaches and lower data quality than in some other papers
Architecture is not adequate: positioning is misaligned

What do you think, what should I check?

ConvNet - What to improve regarding architecture, procedure and technique?

Add your own answers!

Ask a Question