Approach fpr extracting/cropping features images using deeplearning and no annotations

Question

Let's say I want to have a bunch of images of hats from videos. How would I priniciple build something that would learn to recognize, and crop or bound box hats? I heard you need a dataset with bounding boxes manually drawn for training, but it seems there would be a way for a nn to identify that on their own?

I'm trying to understand the possibility of scraping video for different items. I.e. give it images of 1000 hats, and then it will crop out images of hats from a bunch of video files.

I am thinking this could be an interesting thing to work on, but would need some advicein terms of how to arrpoach it.

Also, the next logical thing is then to put hats on people in movies somehow, but that would be phase 2.

Thanks

user3658307 · Answer

So all you have is a set of images containing only cropped hats? 
One idea is to leverage synthetic data to learn an object detector.
It would be better if you had crops of natural images with hats, not just hats alone (as it will lessen the domain shift).

Basically, the following: (1) take your hat images and generate "ground truth" images by randomly pasting a random number of hats (in random poses) into a random background image (from anywhere, ideally similar to the backgrounds of images that you plan to run on at test time). (2) Train an object detector model (e.g. Mask-RCNN, Faster R-CNN, or YOLO9000) using your synthetic dataset (since you know the "real" hat positions in the images, since you made them). (3) given a video, decompose it into a sequence of image frames, and run your trained detector frame-by-frame. (4) use the output of the detector to obtain the hat positions and crop them out.

A simple extension would be to generate synthetic videos for training, where your hats move around and you keep track of them; another one might be to try to make the synthetic images more realistic, by say attaching random people or at least heads underneath the hats.

The main challenge is the domain adaptation between synthetic and real images. Any of the many modern techniques for handling domain shift would likely be helpful. For instance, using GANs (also here).

It may interest you to know that this simple synthetic methodology has proven useful in practice. For instance, Tremblay et al at NVIDIA recently used it to also train an object detector via domain randomization. Other applications include robotics.

A much more trivial, but easier to implement solution is to train a hat classifier $C$, which takes in an image or image patch, and output whether the patch contains a hat or not. Using your cropped hat dataset as the true positives, and getting true negatives by randomly cropping patches from any random dataset of images, you can train $C$. Given a video, you then simply use a sliding window approach, where for every image patch at every frame, you run $C$ on it. You can then threshold the output of $C$ and/or crop out areas with high values from the classifier.

Approach fpr extracting/cropping features images using deeplearning and no annotations

One Answer

Add your own answers!

Ask a Question