End-to-End Object Detection with Transformers


End-to-End Object Detection with Transfomers proposes a new model called DEtection TRansformer (DETR) which is an End-to-End model for object detection that performs competitively well against SOTA models in benchmarks. The cool thing about this paper is how simple the architecture of the model is. In order to get the bounding boxes, previous models tend to split the problem up into multiple stages: 1) getting a large number of proposals 2) classifying the proposals 3) post processing to clean up overlapping proposals. DETR aims to simplify that by proposing an end-to-end model. There are a few interesting ideas that caught my eye so I shall share those here.

Set prediction

The model predicts a set of N predictions, regardless of the number of objects in the image. Each prediction is a tuple of (class, bounding box). The class can be one of the classes to be detected or the null object \varnothing. The bounding box is made up for the usual attributes: x, y, width, height. During training, the ground truth data is made of the objects in the image padding with \varnothing.

Bipartite matching

In order to account for ordering differences in the permutations of the predictions compared to the ground truth, a bipartite matching loss is used. Given a particular loss function Lmatch(y^,y)\mathcal{L}_{match}(\hat{y}, y), it finds the permutation for the predictions that gives the minimum total loss.

Bipartite Graph

This matching portion plays the same role as heuristic rules used to match proposal or anchors to past ground truth objects in past object detection models. The solution for the above problem is found using the Hungarain Algorithm. As can be seen from the above image, the name for the loss function comes from the Bipartite Graph that is seen in graph theory.

Loss function

As for the actual loss function, its a straightforward negative log-likelihood with a bounding box loss.

Some interesting notes:

  • the weight of the log probability is reduced by a factor of 10 when the class is \varnothing in order to account for the class imbalance
  • bounding box loss is a linear combination of 1\ell_1 loss and IoU (intersection over union) loss, the IoU helps mitigate the issue with the 1\ell_1 loss where the magnitude of loss would be higher for larger boxes compared to smaller ones, even if their relative errors are the same.



The architecture itself is relatively straightforward with a CNN stacked on a Transformer. The interesting portion is the input embeddings to the decoder.

Object Queries

These are N learnt positional embeddings passed in as inputs to the decoder. Each of the input embeddings corresponds to one of the final predictions. The decoder transforms these embeddings to give the final prediction. Because all of the inputs are being processed at the same time, the model can globally reason about the final set of objects.

This makes even more sense when we take a look at the visualization of the box predictions of each of the N inputs. The scatter plot below shows the positions for 4 of the predictions over the COCO dataset. The points represent the normalized center of the predicted bounding box and the color represents the size of the bounding box. Green corresponds to small bounding boxes, red to large horizontal bounding boxes and blue to large vertical bounding boxes.


It is clear that each of the input specializes in a different segment of the images. The top left specializes on smaller objects on the bottom left of the image, while the bottom right focuses specializes on bigger objects in the center of the image.

My thoughts: It is likely that this combination of specialization and global reasoning work well together in finding features useful for object detection. As features are found, these are communicated to the entire model which can then use the attention layer to respond accordingly.


It is a paper with some interesting solutions to make an end-to-end system workable. It is also good to see transformers being competitive in image tasks and would be interesting to see if they can be used in more tasks. For more reading, visit their blog or their code which is available on their github. As always, all feedback and discussions welcomed @vivekkalyansk.


Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. “End-to-End Object Detection with Transformers.” ArXiv:2005.12872 [Cs], May 28, 2020. http://arxiv.org/abs/2005.12872.