TransWikia.com

Why does the error of my LSTM not decrease after 10 epochs?

Artificial Intelligence Asked by K. Do on December 13, 2020

Despite the problem being very simple, I was wondering why an LSTM network was not able to converge to a decent solution.

import numpy as np
import keras

X_train = np.random.rand(1000)
y_train = X_train
X_train = X_train.reshape((len(X_train), 1, 1))

model= keras.models.Sequential()
model.add(keras.layers.wrappers.Bidirectional(keras.layers.LSTM(1, dropout=0., recurrent_dropout=0.)))
model.add(keras.layers.Dense(1))

optimzer = keras.optimizers.SGD(lr=1e-1)

model.build(input_shape=(None, 1, 1))
model.compile(loss=keras.losses.mean_squared_error, optimizer=optimzer, metrics=['mae'])
history = model.fit(X_train, y_train, batch_size=16, epochs=100)

After 10 epochs, the algorithm seems to have reached its optimal solution (around 1e-4 RMSE), and is not able to improve further the results.

A simple Flatten + Dense network with similar parameters is however able to achieve 1e-13 RMSE.

I’m surprised the LSTM cell does not simply let the value through, is there something I’m missing with my parameters? Is LSTM only good for classification problems?

One Answer

I think there are some problems with your approach.

Firstly, looking at the Keras documentation, LSTM expects an input of shape (batch_size, timesteps, input_dim). You're passing an input of shape (1000, 1, 1), which means that you're having "sequences" of 1 timestep.

RNNs have been proposed to capture temporal dependencies, but it's impossible to capture such dependencies when the length of each series is 1, and the numbers are randomly generated. If you want to create a more realistic scenario, I would suggest you generate a sine wave, since it has a smooth periodic oscillation. Afterward, increase the timesteps from 1, and you can test on the following timestamps (to make predictions).

For the second part, if you think about a normal RNN (I will explain for a simple RNN but you can imagine a similar flow for LSTM) and a Dense layer when applied to 1 timestamp, there are not so many many differences. The dense layer will have $Y=f(XW + b)$, where $X$ is the input, $W$ is the weight matrix, $b$ is the bias and $f$ is the activation function. Whereas RNN will have $Y=f(XW_1 + W_2h_0 + b)$, since is the first timestamp $h_0$ is $0$, so we can reduce it to $Y=f(XW_1 +b)$, which is identical with the Dense layer. I suspect that the results differences are caused by the activation functions, by default Dense layer has no activation function, and LSTM has tanh and sigmoid.

Answered by razvanc92 on December 13, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP