Data Science Asked on June 14, 2021
PLEASE NOTE: I am not trying to improve on the following example. I know you can get over 99% accuracy. The whole code is in the question. When I tried this simple code I get around 95% accuracy, if I simply change the activation function from sigmoid to relu, it drops to less than 50%. Is there a theoretical reason why this happens?
I have found the following example online:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.utils import np_utils
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
Y_train = np_utils.to_categorical(Y_train, classes)
Y_test = np_utils.to_categorical(Y_test, classes)
batch_size = 100
epochs = 15
model = Sequential()
model.add(Dense(100, input_dim=784))
model.add(Activation('sigmoid'))
model.add(Dense(10))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='sgd')
model.fit(X_train, Y_train, batch_size=batch_size, epochs=epochs, verbose=1)
score = model.evaluate(X_test, Y_test, verbose=1)
print('Test accuracy:', score[1])
This gives about 95% accuracy, but if I change the sigmoid with the ReLU, I get less than 50% accuracy. Why is that?
I took your exact code, replaced
model.add(Activation('sigmoid'))
by
model.add(Activation('relu'))
and indeed I experienced the same problem than you: only 55% accuracy, which is bad...
Solution: I rescaled the input image values from [0, 255] to [0,1] and it worked: 93% accuracy with ReLU! (inspired from here):
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.utils import np_utils
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
Y_train = np_utils.to_categorical(Y_train, 10)
Y_test = np_utils.to_categorical(Y_test, 10)
batch_size = 100
epochs = 15
model = Sequential()
model.add(Dense(100, input_dim=784))
model.add(Activation('relu'))
model.add(Dense(10))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='sgd')
model.fit(X_train, Y_train, batch_size=batch_size, epochs=epochs, verbose=1)
score = model.evaluate(X_test, Y_test, verbose=1)
print('Test accuracy:', score[1])
Output:
Test accuracy: 0.934
Potential explanation: when using an input in [0, 255], then when doing the weighted sum for the layer $L$: $z = a^{(L-1)} w^{(L)} + b^{(L)}$, the value $z$ will often be big too. If $z$ is often big (or even if it's often > 0), let's say around 100, than $ReLU(z) = z$, and we totally lose the "non-linear" aspect of this activation function! Said in another way: if the input is in [0, 255], then $z$ is often far from 0, and we totally avoid the place where "interesting non-linear things" are going on (around 0 the ReLU function is non linear and looks like __/
)... Now when the input is in [0,1], then the weighted sum $z$ can often be close to 0: maybe it sometimes goes below 0 (since the weights are randomly-initialized on [-1, 1], it's possible!), sometimes higher than 0, etc. Then more neuron activation/deactivation is happening... This could be a potential explanation of why it works better with input in [0, 1].
Correct answer by Basj on June 14, 2021
Because with MNIST, you are trying to predict based on probabilities.
The sigmoid function squishes the $x$ value between $0$ and $1$. This helps to pick the most probable digit that matches the label.
The ReLU function doesn't squish anything. If the $x$ value is less than $0$, the the output is $0$. If its more than $0$, the answer is the $x$ value itself. No probabilities are being created.
Honestly, I'm suprised you got anything more than 10% when you plug it in.
Answered by daleadil on June 14, 2021
I got around 98% accuracy using ReLu activation function. I have used the following architecture :
I think you should add output clipping and then train it, hope that will work fine.
Answered by Yash Khare on June 14, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP