Artificial Intelligence Asked by K. Do on December 13, 2020
Despite the problem being very simple, I was wondering why an LSTM network was not able to converge to a decent solution.
import numpy as np
import keras
X_train = np.random.rand(1000)
y_train = X_train
X_train = X_train.reshape((len(X_train), 1, 1))
model= keras.models.Sequential()
model.add(keras.layers.wrappers.Bidirectional(keras.layers.LSTM(1, dropout=0., recurrent_dropout=0.)))
model.add(keras.layers.Dense(1))
optimzer = keras.optimizers.SGD(lr=1e-1)
model.build(input_shape=(None, 1, 1))
model.compile(loss=keras.losses.mean_squared_error, optimizer=optimzer, metrics=['mae'])
history = model.fit(X_train, y_train, batch_size=16, epochs=100)
After 10 epochs, the algorithm seems to have reached its optimal solution (around 1e-4
RMSE), and is not able to improve further the results.
A simple Flatten + Dense network with similar parameters is however able to achieve 1e-13 RMSE.
I’m surprised the LSTM cell does not simply let the value through, is there something I’m missing with my parameters? Is LSTM only good for classification problems?
I think there are some problems with your approach.
Firstly, looking at the Keras documentation, LSTM expects an input of shape (batch_size, timesteps, input_dim)
. You're passing an input of shape (1000, 1, 1)
, which means that you're having "sequences" of 1 timestep.
RNNs have been proposed to capture temporal dependencies, but it's impossible to capture such dependencies when the length of each series is 1, and the numbers are randomly generated. If you want to create a more realistic scenario, I would suggest you generate a sine wave, since it has a smooth periodic oscillation. Afterward, increase the timesteps from 1, and you can test on the following timestamps (to make predictions).
For the second part, if you think about a normal RNN (I will explain for a simple RNN but you can imagine a similar flow for LSTM) and a Dense
layer when applied to 1 timestamp, there are not so many many differences. The dense layer will have $Y=f(XW + b)$, where $X$ is the input, $W$ is the weight matrix, $b$ is the bias and $f$ is the activation function. Whereas RNN will have $Y=f(XW_1 + W_2h_0 + b)$, since is the first timestamp $h_0$ is $0$, so we can reduce it to $Y=f(XW_1 +b)$, which is identical with the Dense
layer. I suspect that the results differences are caused by the activation functions, by default Dense
layer has no activation function, and LSTM has tanh and sigmoid.
Answered by razvanc92 on December 13, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP