Data Science Asked by Tonnz on June 11, 2021
I want to train an LSTM model with variable length inputs. Specifically I want to use as little padding as possible while still using minibatches.
As far as I understand each batch requires a fixed number of timesteps for all inputs, necessitating padding. But different batches can have different numbers of timesteps for the inputs, so in each batch inputs only have to be padded to the length of the longest input-sequence in that same batch. This is what i want to implement.
What I need to do:
Sadly my googling skills have failed me entirely. I can only find examples and resources on how to pad the entire input set to a fixed length, which is what i had been doing already and want to move away from. Some clues point me towards tensorflow’s Dataset API, yet I can’t find examples of how and why it would apply to the problem I am facing.
I’d appreciate any pointers to resources and ideally examples and tutorials on what I am trying to accomplish.
The answer to your needs is called "bucketing". It consists of creating batches of sequences with similar length, to minimize the needed padding.
In tensorflow, you can do it with tf.data.experimental.bucket_by_sequence_length
. Take into account that previously it was in a different python package (tf.contrib.data.bucket_by_sequence_length
), so the examples online may containt the outdated name.
To see some usage examples, you can check this jupyter notebook, or other answers in stackoverflow, or this tutorial.
Correct answer by noe on June 11, 2021
Found a solution, which is to pass a custom batch generator of type keras.utils.Sequence to the model.fit function (where one can write any logic to construct batches and to modify/augment training data) instead of passing the entire dataset in one go. Relevant code for reference:
# Must implement the __len__ function returning the number
# of batches in this dataset, and the __getitem__ function
# that returns a tuple (inputs, labels).
# Optionally, on_epoch_end() can be implemented which as the
# name suggest is called at the end of each epoch. Here one
# can e.g. shuffle the input data for the next epoch.
class BatchGenerator(keras.utils.Sequence):
def __init__(self, inputs, labels, padding, batch_size):
self.inputs = inputs
self.labels = labels
self.padding = padding
self.batch_size = batch_size
def __len__(self):
return int(np.floor(len(self.inputs) / self.batch_size))
def __getitem__(self, index):
max_length = 0
start_index = index*batch_size
end_index = start_index+batch_size
for i in range(start_index, end_index):
l = len(self.inputs[i])
if l>max_length:
max_length = l
out_x = np.empty([self.batch_size, max_length], dtype='int32')
out_y = np.empty([self.batch_size, 1], dtype='float32')
for i in range(self.batch_size):
out_y[i] = self.labels[start_index+i]
tweet = self.inputs[start_index+i]
l = len(tweet)
for j in range(l):
out_x[i][j] = tweet[j]
for j in range(l, max_length):
out_x[i][j] = self.padding
return out_x, out_y
# The model.fit function can then be called like this:
training_generator = BatchGenerator(tokens_train, y_train, pad, batch_size)
model.fit(training_generator, epochs=epochs)
Answered by Tonnz on June 11, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP