Classifying videos with varying length using ConvLSTM2D in tensorflow

Question

I have a collection of videos, where I would like to extract a frame for every second, and then feed them through a ConvLSTM2D for binary classification.

I was under the impression that a LSTM could take varying input sizes, but after many hours of googling it seems like I either need to:

Use padding and masking
Use ragged tensors
Actually use varying input length, but use batch size of 1

I'm not sure how to proceed from here, since I cant find any resources for padding and masking a sequence of images. Ragged tensors are confusing, and I cant find any examples for a sequence of images. When trying to use a batch size of 1, tensorflow still complains that the inputs are not the same size when using model.fit.
The length of the video is actually important, so thats the reason I'm using a variable amount of images, but I could possibly extract a fixed amount of frames.
Any code examples or suggestions appreciated

danielsvane · Accepted Answer

I found a solution that works for me.
Since I wanted to avoid using padding and masking, and didn't fully understand ragged tensors, I decided to continue with using varying input lengths.
My training data consists of a list of image stacks between 6-82 frames. When trying to use this directly with model.fit(x=x, y=y, batch_size=1) where x is a list of tensors, tensorflow will complain that the input dimensions have varying size. I thought this didn't matter since the batch size was 1, but apparently tensorflow tries to change the list of image stacks to a tensor.
A way around this, is to pass a training sample for every step using a generator:
class ArtificialDataset(tf.data.Dataset):
  def _generator(samples):
    for i in samples:
      yield (image_stack, output)

And then making sure that the steps per epoch is the size of the data set
training_data = ArtificialDataset(training_samples)
model.fit(training_data, epochs=epochs, steps_per_epoch=len(training_samples))

This way tensorflow never tries to create a tensor with varying inner dimensions. A drawback of this method is that no batching occurs, so training takes a while. In my case this doesn't matter much since the network size and input data already requires me to load very few samples at a time.
An optimization would be to batch same-length videos (or videos with same number of extracted frames). I might do this at a later time, since the implementation of varying batch sizes is too much work right now.

Classifying videos with varying length using ConvLSTM2D in tensorflow

One Answer

Add your own answers!

Ask a Question