Predicting sequence element based on the previous M and the following N elements

Question

I have an array of sequences of equal length, each sequence contains 300 numbers (M=300). Each element in a sequence is a number from 1 to 9:
13571398...2455 # 300 numbers
33344467...1143 # 300 numbers
...
...
...
66118859...2121 # 300 numbers

My task is to build a model that predicts an element (number) at sequence positions from 180 to 190 based on the first 179 elements and the last 110 elements in a sequence.
In other words, given elements at positions from 0 to 179 and from 191 to 299 predict elements in a sequence at positions from 180 to 190.
I am thinking about the following steps to solve this task with Keras BiLSTM model:

Split all sequences into train / validation / test sets
Train BiLSTM on a train set to predict next number anywhere in a sequence
In test and validation sets randomly replace K elements at positions from 180 to 190 with 0 (a number that does not exists in original sequences).
Use pre-trained BiLSTM to predict true values of '0' elements in validation and test sets

Please help with the following questions:

How should I represent data and classes for BiLSTM in this case? It looks like my data and classes are the one and the same thing. Both 1...9 numbers are data and corresponding classes to BiLSTM.
What data structures, encodings in this case should I create to train and predict with Keras BiLSTM?
How to evaluate quality of this model on a train and test sets ?

Any other ideas of using other models, in particular Transformers (PyTorch, Tesnsorflow) are very welcome, thanks!

Adam Oudad · Accepted Answer

The framing of your problem is close to so-called language modeling task. Because your input data is fixed-length samples, you can use a seq2seq model with fixed-size context embedding.
What this means is you would essentially have an encoder, Bi-LSTM for example which encodes your input into a fixed representation (by concatenating the final output states of forward and backward LSTM), and a decoder, for example LSTM, which sequentially produces the output tokens.
Your objective function could be a mean of cross-entropy loss over each output token, or a more complex loss like CTC.
You can also simplify it by just predicting the masked tokens, instead of the whole sentence, as output of your neural network.
The fact that your tokens are integers makes no difference and actually simplifies the embedding. You can simply feed the data as is to an embedding layer in Keras or PyTorch.
If you use PyTorch, there is this tutorial that I would recommend, using transformer instead of LSTM.

Predicting sequence element based on the previous M and the following N elements

One Answer

Add your own answers!

Ask a Question