Implementing training in PyTorch

Question

I wish to accomplish the following task in PyTorch-

I have the COCO dataset, wherein each data sample is used in training YOLO v3. After being processed by the model, the sample is to be deleted if it satisfies a certain condition. The data sample is thus no longer used for training in further epochs.

I now have two questions regarding implementation -

1) How do I process each sample individually? Do I go about this by setting batch size = 1? Or is there any advantage to disabling automatic batching. If so, how do I go about this.

2) How exactly do I delete the sample from the dataset for further epochs. Is there any way to skip this sample in the DataLoader?

zachdj · Answer

1) How do I process each sample individually? Do I go about this by setting batch size = 1? Or is there any advantage to disabling automatic batching. If so, how do I go about this.

If you set batch size to 1, then you are effectively disabling batching.  Samples will be processed one-at-a-time and gradients will be computed for single samples.  This is not necessarily a problem - just keep in mind that you'll lose the advantages of mini-batching.

If you are using PyTorch's DataLoader, then just set batch_size and batch_sampler to None.  According to the docs, this will disable automatic batching.

2) How exactly do I delete the sample from the dataset for further epochs. Is there any way to skip this sample in the DataLoader?

You can implement a custom PyTorch dataset.  Internally, keep track of which samples have been deleted.  If you're using an iterable-style dataset, then the next step is pretty easy.  Just skip over the deleted samples in __iter__().

Map-style datasets will be a little more tricky.  Your len() function will need to return len(dataset) - # of deleted samples.  And you will have to find some indexing scheme that prevents deleted samples from being accessed.  One naive idea is simply to reindex after every deletion.  The obvious problem is that the DataLoader's random sampler won't have any way to know this, so a single sample could be processed multiple times in one epoch.  Another idea is to index your samples by some unique key rather than integers.

user97950 · Answer

2) I would try to use this:

https://pytorch.org/docs/stable/data.html#torch.utils.data.SubsetRandomSampler

It is a sampler that limits the dataloader to certain index. Wrap it around with a batchsampler. Also, modify the dataset, so that it will give (image, target, index) instead of the normal (image, target).

1) collate_fcn() in dataloader does that...the batch loaded will be given to collate_fcn to be preprocessed. It was hard for me to understand this at the beginning.

https://discuss.pytorch.org/t/how-to-create-a-dataloader-with-variable-size-input/8278

Can you tell me whether this works? or what your final solution is? Good luck!

Implementing training in PyTorch

2 Answers

Add your own answers!

Ask a Question