TransWikia.com

Multi-class neural net always predicting 1 class after optimization

Data Science Asked on December 14, 2021

During training, the neural net settles into a place where it always predicts 1 of the 5 classes.

My train and test sets are distributed as such:

Train Set
Samples: 269,501. Features: 157
Data distribution
16.24% 'a'
39.93% 'b'
9.31%  'c'
20.86% 'd'
13.67% 'e'

Test Set
Samples: 33,967. Features: 157
Data distribution
10.83% 'a'
35.39% 'b'
19.86% 'c'
16.25% 'd'
17.66% 'e'

Note the percentages of class b!

I am training an mlp with dropout, and training and test (aka validation) accuracies both plateau, perfectly matching the train and test distributions of 1 of my 5 classes, i.e. it is learning to always predict 1 class out of 5 classes! I’ve verified the classifier is always predicting b.

I’ve tried batch_size of 0.25 and 1.0 and made double-y sure the data was shuffled the data. I tried both SGD and Adam optimizers with and without decay and different learning rates and still the same result. Tried dropout of 0.2 and 0.5. EarlyStopping of 300 epochs.

Every so often I’ll get a situation where during training it’ll pop out of where it has settled for training accuracy and validation accuracy but then validation always goes down and training goes up — or in other words, overfitting.

Output, cut off after 6 epochs. It doesn’t always converge this fast, just with this particular SGD optimizer:

Epoch 1/2000
Epoch 00000: val_acc improved from -inf to 0.35387, saving model to /home/user/src/thing/models/weights.hdf
269501/269501 [==============================] - 0s - loss: 1.6094 - acc: 0.1792 - val_loss: 1.6073 - val_acc: 0.3539
Epoch 2/2000
Epoch 00001: val_acc did not improve
269501/269501 [==============================] - 0s - loss: 1.6060 - acc: 0.3993 - val_loss: 1.6042 - val_acc: 0.3539
Epoch 3/2000
Epoch 00002: val_acc did not improve
269501/269501 [==============================] - 0s - loss: 1.6002 - acc: 0.3993 - val_loss: 1.6005 - val_acc: 0.3539
Epoch 4/2000
Epoch 00003: val_acc did not improve
269501/269501 [==============================] - 0s - loss: 1.5930 - acc: 0.3993 - val_loss: 1.5967 - val_acc: 0.3539
Epoch 5/2000
Epoch 00004: val_acc did not improve
269501/269501 [==============================] - 0s - loss: 1.5851 - acc: 0.3993 - val_loss: 1.5930 - val_acc: 0.3539
Epoch 6/2000

Code:
Model creation:

def create_mlp(input_dim, output_dim, dropout=0.5, arch=None):
    """Setup neural network model (keras.models.Sequential)"""
    # default mlp architecture
    arch = arch if arch else [64,32,32,16]

    # setup densely connected NN architecture (MLP)
    model = Sequential()
    model.add(Dropout(dropout, input_shape=(input_dim,)))
    for output in arch:
        model.add(Dense(output, activation='relu', W_constraint=maxnorm(3)))
        model.add(Dropout(dropout))
    model.add(Dense(output_dim, activation='sigmoid'))

    # compile model and save architecture to disk
    sgd = SGD(lr=0.01, momentum=0.9, decay=0.0001, nesterov=True)
    # adam = Adam(lr=0.001, decay=0.0001)
    model.compile(loss='categorical_crossentropy',
                  optimizer=sgd,
                  metrics=['accuracy'])
    return model

And inside main after some preprocessing:

    # labels must be one-hot encoded for loss='categorical_crossentropy'
    # meaning, of possible labels 0,1,2: 0->[1,0,0]; 1->[0,1,0]; 2->[0,0,1]
    y_train_onehot = to_categorical(y_train, n_classes)
    y_test_onehot = to_categorical(y_test, n_classes)

    # get neural network architecture and save to disk
    model = create_mlp(input_dim=train_dim, output_dim=n_classes)
    with open(clf_file(typ='arch'), 'w') as f:
        f.write(model.to_yaml())

    # output logs to tensorflow TensorBoard
    # NOTE: don't use param histogram_freqs until keras issue fixed
    #       https://github.com/fchollet/keras/pull/5175
    tensorboard = TensorBoard(log_dir=opts.tf_dir)

    # only save model weights for best performing model
    checkpoint = ModelCheckpoint(clf_file(typ='weights'),
                                 monitor='val_acc',
                                 verbose=1,
                                 save_best_only=True)

    # stop training early if validation accuracy doesn't improve for long enough
    early_stopping = EarlyStopping(monitor='val_acc', patience=300)

    # shuffle data for good measure before fitting
    x_train, y_train_onehot = shuffle(x_train, y_train_onehot)

    np.random.seed(seed)
    model.fit(x_train, y_train_onehot,
              nb_epoch=opts.epochs,
              batch_size=train_batch_size,
              shuffle=True,
              callbacks=[tensorboard, checkpoint, early_stopping],
              validation_data=(x_test,y_test_onehot))

3 Answers

My guess is that the data you provide does not have enough information to predict $a, b, c, d$ or $e$. Therefore, because $b$ is over-represented in the dataset, it will always predict $b$, because thats the safest bet. If you didn't know anything about the input or you if you wouldn't be able to extract any useful information from it, you would probably also always predict $b$, just because it's the most likely when picking a random sample.

To fix this, you either need to get better data, which holds more information, or balance your dataset (if your task allows that), so that all labels appear equally often.

Answered by Evator on December 14, 2021

You learn a lot by comparing to a naive model. A naive model is one without any features. As a default, it will always predict the most likely Target. Note that this is exactly what your model is doing. This indicates that the features are not helping with making a prediction. Have you done a basic distribution analysis to see what are the features impact the distribution of the target? This is where I'd start.

Answered by Paul on December 14, 2021

It could be a bug in your code, problems with your training set (maybe you don't have the file format quite right), or some other implementation issue.

Are you sure you want to use a sigmoid activation function in your last layer? I would have expected that the normal approach would be to use a softmax as the last layer (so that you can treat the outputs as the probability of each class, i.e., so that they're normalized to sum to 1). You might try that.

Alternatively, this might be a 'class imbalance' problem. Do some searching on that term and you'll find a bunch of standard methods for dealing with it. You can balance the training set, or use 'weights' on the instances, or adjust the threshold based on the priors. However, as others have pointed out, the imbalance is not severe enough that I would have expected it to cause this strong of an bias.

It's also possible that your features are useless and don't help predict the output (e.g., they are not related to or correlated to the output). That would also be consistent with what you are seeing.


Side note: My understanding is that the Adam optimizer generally is more effective than plain SGD, though I don't see any reason to expect that to be the issue here.

Answered by D.W. on December 14, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP