Data Science Asked by Alexandru Dinu on July 13, 2021
I want to train YOLO3 for a custom dataset that has raw labels in JSON format. Each bounding box in JSON is specified as [x1, y1, x2, y2]
.
So far, I have converted [x1, y1, x2, y2]
to [cx, cy, pw, ph]
, that is, center x, center y of the bounding box, scaled to image width and height; pw, ph
are the ratios of bounding box’s width and height relative to the image’s width and height. But I don’t think that’s complete (or even right).
As far as I understood, YOLO3 assigns N anchor boxes to each grid cell (image is divided into SxS grid cells) and thus, the prediction of a bounding box is relative to a given anchor box from a grid cell (that one anchor box that has the highest IOU with the ground truth). The formulas are below:
Therefore, how should I prepare ground truths so that YOLO3 can understand them? Do I have to, somehow, reverse those formulas? Also how to account for different number of scales and different number of anchor boxes?
For a concrete example: Suppose I have a 416 x 416
image and a configuration of 13 x 13
grid cells. The ground truth bounding box (from the dataset) is [x1=100, y1=100, x2=200, y2=200]
, class = c. What will be the converted values for YOLO3?
L.E.: Say we have 2 classes [car, person]
and 2 anchors (1 wide, 1 tall).
Would the output be a tensor of shape 13 x 13 x (2*(5+2))
where most of the values (that vector of shape 2*(5+2)
) for the grid cells are 0 except for one particular cell (the one in which the center of the ground truth bounding box falls)?
In this case, for that cell (say c[i,j]
), suppose the largest IOU is given for anchor 2 and that the ground truth class is person
. This means that c[i,j,:7]
(anchor 1 prediction) will be ignored and c[i,j,7:]
(anchor 2 prediction) will be [bx, by, bw, bh, conf, 0, 1]
.
Therefore, how should the ground truth for the person’s bounding box should be encoded? Should it be an offset from a particular anchor of a grid cell? This is what it’s still unclear for me.
Thank you!
Therefore, how should the ground truth for the person's bounding box should be encoded?
The bounding boxes are defined by the normalized coordinate of the bounding box center. It should be represented as a single line in a text file as:
<object-class> <x> <y> <bb_width> <bb_height>
Given 3 classes:
0 full-image
1 top-left-quater
2 left-half
Three bounding boxes that fill, respectively, the full image, the top left quater and the left half would the represented like this:
0 0.500000 0.500000 1.000000 1.000000
1 0.250000 0.250000 0.500000 0.500000
2 0.250000 0.500000 0.500000 1.000000
For a bit more detail:
Answered by smedegaard on July 13, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP