End-to-End Object Detection with Transfomers proposes a new model called DEtection TRansformer (DETR) which is an End-to-End model for object detection that performs competitively well against SOTA models in benchmarks. The cool thing about this paper is how simple the architecture of the model is. In order to get the bounding boxes, previous models tend to split the problem up into multiple stages: 1) getting a large number of proposals 2) classifying the proposals 3) post processing to clean up overlapping proposals. DETR aims to simplify that by proposing an end-to-end model. There are a few interesting ideas that caught my eye so I shall share those here.
The model predicts a set of N predictions, regardless of the number of objects in the image. Each prediction is a tuple of (class, bounding box). The class can be one of the classes to be detected or the null object . The bounding box is made up for the usual attributes: x, y, width, height. During training, the ground truth data is made of the objects in the image padding with .
In order to account for ordering differences in the permutations of the predictions compared to the ground truth, a bipartite matching loss is used. Given a particular loss function , it finds the permutation for the predictions that gives the minimum total loss.
This matching portion plays the same role as heuristic rules used to match proposal or anchors to past ground truth objects in past object detection models. The solution for the above problem is found using the Hungarain Algorithm. As can be seen from the above image, the name for the loss function comes from the Bipartite Graph that is seen in graph theory.
As for the actual loss function, its a straightforward negative log-likelihood with a bounding box loss.
Some interesting notes:
The architecture itself is relatively straightforward with a CNN stacked on a Transformer. The interesting portion is the input embeddings to the decoder.
These are N learnt positional embeddings passed in as inputs to the decoder. Each of the input embeddings corresponds to one of the final predictions. The decoder transforms these embeddings to give the final prediction. Because all of the inputs are being processed at the same time, the model can globally reason about the final set of objects.
This makes even more sense when we take a look at the visualization of the box predictions of each of the N inputs. The scatter plot below shows the positions for 4 of the predictions over the COCO dataset. The points represent the normalized center of the predicted bounding box and the color represents the size of the bounding box. Green corresponds to small bounding boxes, red to large horizontal bounding boxes and blue to large vertical bounding boxes.
It is clear that each of the input specializes in a different segment of the images. The top left specializes on smaller objects on the bottom left of the image, while the bottom right focuses specializes on bigger objects in the center of the image.
My thoughts: It is likely that this combination of specialization and global reasoning work well together in finding features useful for object detection. As features are found, these are communicated to the entire model which can then use the attention layer to respond accordingly.
It is a paper with some interesting solutions to make an end-to-end system workable. It is also good to see transformers being competitive in image tasks and would be interesting to see if they can be used in more tasks. For more reading, visit their blog or their code which is available on their github. As always, all feedback and discussions welcomed @vivekkalyansk.
References:
Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. “End-to-End Object Detection with Transformers.” ArXiv:2005.12872 [Cs], May 28, 2020. http://arxiv.org/abs/2005.12872.