Data Science Asked on May 15, 2021
So in the paper for OCR pr LaTex formula extraction from image What You Get Is What You See: A Visual Markup Decompiler, they pass the features of the CNN into RNN Encoder. But there is problem that rather than passing the features directly, they have proposed a solution to change it into the grid.
Extract the features from the CNN and then arrange those extracted features in a grid to pass into an RNN encoder. This is the exact language they have used.
What is meant by that? Theoratically speaking, if I have an CNN
without any Dense/Fully Connected layer and produces an output of [batch,m*n*C]
, then how can I change it in the form of a grid
?? Please see the picture below. So after getting the output from the CNN
, they have chnged it somehow before passing it to RNN
. What is the method that one can use to get this transformation?
So if I have to pass something to keras.layers.RNN()(that_desired_grid_format)
, what should be this grid format and how can I change it?
It seems they use a shared RNN which process each row sequentially on the sequence of concatenated channels of individual pixels. From the paper
Let the output of the ConvNet be of size (batch_size, height, width, channels)
. The RNN expects an input of size (batch_size, sequence_length, input_size)`. So you have to reshape it with the following correspondence.
batch_size*height -> batch_size
channels -> input_size
width -> sequence_length
And process each row (along height
dimension) with the same RNN and concatenate the result.
To do that, we simply reshape to merge the batch and height axis into one dimension so that our RNN will process columns independantly.
rnn_input = keras.layers.Reshape((batch_size*height, width, channels))(convnet_output)
rnn_output = keras.layers.RNN(hidden_dim, return_sequences=True)(rnn_input)
rnn_output
will have shape (batch_size*height, width, hidden_dim)
. You can then combine this tensor into a context vector using a dense layer with tanh activation, as it is written in the paper.
The paper also use trainable initial state for the RNN, you might be interested in this library to implement it.
If you set your Conv2D
layer with "channels_first"
, the output convnet_output
will be of size (batch_size, channels, height, width)
. Therefore you need first to permute the dimensions before reshaping.
convnet_output = keras.layers.Permute((0, 2, 3, 1))(convnet_output)
After this step, convnet_output
has dimension (batch_size, height, width, channels)
. You can then proceed as previously, reshaping and feeding to the RNN.
rnn_input = keras.layers.Reshape((batch_size*height, width, channels))(convnet_output)
rnn_output = keras.layers.RNN(hidden_dim, return_sequences=True)(rnn_input)
Correct answer by Adam Oudad on May 15, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP