What's an appropriate datastore for variable length sequence data for PyTorch consumption?

Question

I have a large number of sequences - potentially hundreds of thousands - each consisting of between 100 and 10,000 items, which each consist of about 5 floats.

I need a datastore that can rapidly serve these up in batches for PyTorch training. I also need to be able to rapidly write new sequences to the store. It's like an experience replay buffer for reinforcement learning, but I want to store every single run.

These sequences should each have some attached structured metadata in a queryable format so that I can select subsets of sequences.

The best solution looks like HDF5 - either through h5py or PyTables - except that I don't know how to make it efficiently handle the variable sequence lengths. Padding isn't appropriate because of the wildly varying lengths, and storing each sequence as its own HDF5 dataset seems like a poor idea as HDF5 doesn't seem to be optimised for massive numbers of small datasets.

Ideas on my radar include Pandas multi-indexing, HDF5 region references, and building a custom metadata index system from scratch. I'm not really sure where to go from here.

Storage compactness matters - I need to be reasonably efficient with my storage space.

Sam · Answer

What I've opted for at the moment is packing all of the samples into a single HDF5 table buffer, and keeping a separate table with metadata that tracks each individual sequence's buffer position and length. This works, but I won't be marking this answer as correct because I'm not satisfied with it. This storage method is very poorly suited to editing, and it's vulnerable to loss if a bug were to cause the tables to become out of sync.

Answered by Sam on April 4, 2021

What's an appropriate datastore for variable length sequence data for PyTorch consumption?

One Answer

Add your own answers!

Ask a Question