Keras: Prediction performance does not match accuracy

Question

I am using Keras/CNN to identify plankton images collected with an in situ camera.  When making confusion matrices on the test sets following training I am finding that the accuracy from the predictions is quite poor.

I have a large number of files and have been using flow_from_directory and generators, I suspect that there might be something happening with the indexing of the predictions (e.g. this post), but as near as I can tell the indexing of the filenames/labels is matching up.

I worked up a quick example that is similar to what I am doing with the mnist_png dataset:

import numpy as np
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras.layers.advanced_activations import LeakyReLU
from keras import backend as K
from keras.preprocessing.image import ImageDataGenerator

img_width, img_height = 28, 28

train_data_dir = 'S:/mnist_png/training'

num_epochs = 100
batch_size = 128
num_test_samples=10000

if K.image_data_format() == 'channels_first':
    input_shape = (3, img_width, img_height)
else:
    input_shape = (img_width, img_height, 3)

model = Sequential()

model.add(Conv2D(32, (3, 3), input_shape=input_shape))
model.add(Activation('relu'))

model.add(Conv2D(16, (3, 3)))
model.add(Activation('relu'))

model.add(MaxPooling2D(pool_size=(3, 3)))
model.add(Dropout(0.5))

model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))

model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))

model.add(MaxPooling2D(pool_size=(3, 3)))
model.add(Dropout(0.5))

model.add(Flatten())
model.add(Dense(512,activation='linear'))
model.add(LeakyReLU(alpha=.3))
model.add(Dropout(0.5))

model.add(Dense(512,activation='linear'))
model.add(LeakyReLU(alpha=.3))

model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])

# augmentation for training
train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    vertical_flip=True,
    rotation_range=90,
    validation_split=0.1)

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='categorical')

model.fit_generator(
    train_generator,
    steps_per_epoch=batch_size,
    epochs=num_epochs)

...after 100 epochs I'm getting loss: 0.7517 - acc: 0.7482.

I then evaluate the test set thusly:

test_data_dir = 'S:/mnist_png/testing'

test_datagen = ImageDataGenerator(
    rescale=1. / 255)

test_generator = test_datagen.flow_from_directory(
    test_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    shuffle='False',
    class_mode='categorical')

#Evaluate model on test set
scores = model.evaluate_generator(test_generator,workers=12)

...the scores for that were 0.6184 and 0.8168, so in the same ballpark.

But it gets weird when I look at the predictions, e.g.:

test_generator.reset()# Necessary to force it to start from beginning
Y_pred = model.predict_generator(test_generator)
y_pred = np.argmax(Y_pred, axis=-1)
sum(y_pred==test_generator.classes)/10000

The proportion of predictions that are actually correct (calculated in the last line) is around 0.1; when I look at a confusion matrix it's all over the place and the diagonal is a lot of zeros.
I have verified that the test_generator.classes match up with the directories in test_generator.filenames, and shuffle is off.  Per this post calling test_generator.reset() should force it to take the files in order, but I'm wondering if it is not.

Does anyone have any thoughts on why this is happening or further steps to troubleshoot it?

malelis · Answer

I suppose you want something like this:
image_generator = ImageDataGenerator().flow_from_directory('test_data_path', target_size=(224, 224), shuffle=False)
true_labels = image_generator.classes
pred_probs = model.predict(image_generator)
preds = pred_probs.argmax(axis=-1)
print (sum(preds[:,0] ==true_labels)/len(true_labels))

Artimizia Dias · Answer

Setting shuffle=false to evaluate_generator and predict_generator fixed the issue for me.

Rob Campbell · Answer

It was indeed the indexing,  answer is here.

Keras: Prediction performance does not match accuracy

3 Answers

Add your own answers!

Ask a Question