End-to-End Object Detection with Transformers

Carion, Nicolas; Massa, Francisco; Synnaeve, Gabriel; Usunier, Nicolas; Kirillov, Alexander M.; Zagoruyko, Sergey

doi:10.48550/arxiv.2005.12872

Cited by 223 publications

(457 citation statements)

References 0 publications

Supporting

Mentioning

454

Contrasting

Unclassified

Order By: Relevance

“…Since their introduction by Vaswani et al (2017), transformers, originally designed for machine translation, were applied to various problems, from text generation (Radford et al, 2018; to image processing (Carion et al, 2020) and speech recognition (Dong et al, 2018) where they soon achieved state-of-the-art performance (Dosovitskiy et al, 2021;Wang et al, 2020b). In mathematics, transformers were used for symbolic integration (Lample & Charton, 2019), theorem proving (Polu & Sutskever, 2020), formal logic (Hahn et al, 2021), SAT solving (Shi et al, 2021), symbolic regression (Biggio et al, 2021) and dynamical systems (Charton et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

Linear algebra with transformers

Charton¹

2021

Preprint

View full text Add to dashboard Cite

Most applications of transformers to mathematics, from integration to theorem proving, focus on symbolic computation. In this paper, we show that transformers can be trained to perform numerical calculations with high accuracy. We consider problems of linear algebra: matrix transposition, addition, multiplication, eigenvalues and vectors, singular value decomposition, and inversion. Training small transformers (up to six layers) over datasets of random matrices, we achieve high accuracies (over 90%) on all problems. We also show that trained models can generalize out of their training distribution, and that out-of-domain accuracy can be greatly improved by working from more diverse datasets (in particular, by training from matrices with non-independent and identically distributed coefficients). Finally, we show that few-shot learning can be leveraged to re-train models to solve larger problems.

show abstract

Section: Introductionmentioning

confidence: 99%

Linear algebra with transformers

Charton¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The experiments are conducted on two backbones, ResNet-50 and ResNet-101. We use a COCO pre-trained DETR (Carion et al 2020) to initialize the weights. The model is trained with AdamW, and the learning rate is set to 1e-4 except that the learning rate for backbone is set to 1e-5.…”

Section: Experimental Settingsmentioning

confidence: 99%

Improving Human-Object Interaction Detection via Phrase Learning and Label Composition

Li¹,

Zhang²,

Yu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Human-Object Interaction (HOI) detection is a fundamental task in high-level human-centric scene understanding. We propose PhraseHOI, containing a HOI branch and a novel phrase branch, to leverage language prior and improve relation expression. Specifically, the phrase branch is supervised by semantic embeddings, whose ground truths are automatically converted from the original HOI annotations without extra human efforts. Meanwhile, a novel label composition method is proposed to deal with the long-tailed problem in HOI, which composites novel phrase labels by semantic neighbors. Further, to optimize the phrase branch, a loss composed of a distilling loss and a balanced triplet loss is proposed. Extensive experiments are conducted to prove the effectiveness of the proposed PhraseHOI, which achieves significant improvement over the baseline and surpasses previous state-of-the-art methods on Full and NonRare on the challenging HICO-DET benchmark.* Equal contribution. This work was done when Zhimin Li was an intern at Megvii, and partially done when Cheng Zou was at Megvii.

show abstract

“…c) This module aims at imposing the two sequences of feature maps to be matched. object detection [9,107]. To better characterize inter-step correlations, we integrate the vision transformer ViT [20] into the backbone by replacing the global average pooling.…”

Section: Vision Transformermentioning

confidence: 99%

SVIP: Sequence VerIfication for Procedures in Videos

Qian¹,

Luo²,

Lian³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations but still conducting the same task. Such a challenging task resides in an open-set setting without prior action detection or segmentation that requires event-level or even frame-level annotations. To that end, we carefully reorganize two publicly available action-related datasets with step-procedure-task structure. To fully investigate the effectiveness of any method, we collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments. Besides, a novel evaluation metric Weighted Distance Ratio is introduced to ensure equivalence for different step-level transformations during evaluation. In the end, a simple but effective baseline based on the transformer with a novel sequence alignment loss is introduced to better characterize long-term dependency between steps, which outperforms other action recognition methods. Codes and data will be released.

show abstract

End-to-End Object Detection with Transformers

Cited by 223 publications

References 0 publications

Linear algebra with transformers

Linear algebra with transformers

Improving Human-Object Interaction Detection via Phrase Learning and Label Composition

SVIP: Sequence VerIfication for Procedures in Videos

Contact Info

Product

Resources

About