How do anchor boxes in object detection work?

Question

I wanted to get more into detail of anchor boxes. However from looking through associated code and papers, I could not grasp the concept in its full detail. I had a look at a lot of quora questions, blog posts and papers as well trying to explain the concept, but they never went into the full detail (for dummies). I hope someone here is kind enough to take some time.
My current understanding is this:

We take an input image and create feature maps of this image until we come up with a feature map of dimension width x height x channels. Where these dimension are dinstinct and smaller than the original input images dimension.
We apply the regression and classification head for the bounding boxes and here the anchors come into play (not sure exactly how). The final loss function then regresses the class of the anchor box through cross-entropy loss and the offset of the coordinates to the anchor box e.g. through an L1 loss. For the loss calculation itself only a few anchor boxes are selected, normally those which possess a high IoU with ground truth boxes and random background boxes.
Additional factors such as focal loss can be applied to increase the training performance. Also the anchor boxes can be applied to features in different depth and thereby scaling of the network.

So far for the high level concept. My question:

How exactly is the loss evaluated for those anchor boxes? does the loss focus only on the inner values of the box or at the whole image?

How do anchor boxes in object detection work?

Add your own answers!

Ask a Question