TransWikia.com

"Change the features of a CNN into a grid to fed into RNN Encoder?" What is meant by that?

Data Science Asked on May 15, 2021

So in the paper for OCR pr LaTex formula extraction from image What You Get Is What You See: A Visual Markup Decompiler, they pass the features of the CNN into RNN Encoder. But there is problem that rather than passing the features directly, they have proposed a solution to change it into the grid.

Extract the features from the CNN and then arrange those extracted features in a grid to pass into an RNN encoder. This is the exact language they have used.

What is meant by that? Theoratically speaking, if I have an CNN without any Dense/Fully Connected layer and produces an output of [batch,m*n*C], then how can I change it in the form of a grid?? Please see the picture below. So after getting the output from the CNN, they have chnged it somehow before passing it to RNN. What is the method that one can use to get this transformation?

enter image description here

So if I have to pass something to keras.layers.RNN()(that_desired_grid_format), what should be this grid format and how can I change it?

One Answer

It seems they use a shared RNN which process each row sequentially on the sequence of concatenated channels of individual pixels. From the paper

enter image description here

Implementation with channels last

Let the output of the ConvNet be of size (batch_size, height, width, channels). The RNN expects an input of size (batch_size, sequence_length, input_size)`. So you have to reshape it with the following correspondence.

batch_size*height -> batch_size
channels -> input_size
width -> sequence_length

And process each row (along height dimension) with the same RNN and concatenate the result.

To do that, we simply reshape to merge the batch and height axis into one dimension so that our RNN will process columns independantly.

rnn_input = keras.layers.Reshape((batch_size*height, width, channels))(convnet_output)
rnn_output = keras.layers.RNN(hidden_dim, return_sequences=True)(rnn_input)

rnn_output will have shape (batch_size*height, width, hidden_dim). You can then combine this tensor into a context vector using a dense layer with tanh activation, as it is written in the paper.

The paper also use trainable initial state for the RNN, you might be interested in this library to implement it.

Implementation with channels first

If you set your Conv2D layer with "channels_first", the output convnet_output will be of size (batch_size, channels, height, width). Therefore you need first to permute the dimensions before reshaping.

convnet_output = keras.layers.Permute((0, 2, 3, 1))(convnet_output)

After this step, convnet_output has dimension (batch_size, height, width, channels). You can then proceed as previously, reshaping and feeding to the RNN.

rnn_input = keras.layers.Reshape((batch_size*height, width, channels))(convnet_output)
rnn_output = keras.layers.RNN(hidden_dim, return_sequences=True)(rnn_input)

Correct answer by Adam Oudad on May 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP