Relaxed Transformer Decoders for Direct Action Proposal Generation

Jing, Tan; Tang, Jiaqi; Wang, Limin; Wu, Guorong

doi:10.1109/iccv48922.2021.01327

Cited by 123 publications

(59 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A two-stage approach for TAL first generates candidate video segments as action proposals, and further classify the proposals into action categories and refine their temporal boundaries. Several previous works focused on action proposal generation, by either classifying anchor windows [8,9,22] or detecting action boundaries [26,36,38,47,84], and more recently using a graph representation [4,76] or Transformers [13,59,67]. Others have integrated proposal generation and classification into a single model [14,55,56,85].…”

Section: Related Workmentioning

confidence: 99%

ActionFormer: Localizing Moments of Actions with Transformers

Zhang¹,

Wu²,

Li³

2022

Preprint

View full text Add to dashboard Cite

Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer-a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local selfattention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 65.6% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 8.7 absolute percentage points and crossing the 60% mAP for the first time. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.0% average mAP) and the more recent EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at https://github. com/happyharrycn/actionformer_release.

show abstract

Section: Related Workmentioning

confidence: 99%

ActionFormer: Localizing Moments of Actions with Transformers

Zhang¹,

Wu²,

Li³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In practice, decoder-only models in vision are mostly used to autoregressively decode captions describing visual input data. Prompting video frames will mix two very different representations in the decoder, so CNNs substitute the encoder to provide context [51], [52], [53]. Both encoderonly and decoder-only layers consist of SA and FF sublayers interleaved with Add+Norm after each of them.…”

Section: Transformer Trends Adopted For Videomentioning

confidence: 99%

“…Instead, relative positional embeddings (RPE) signal the position of one token relative to another, and can also be fixed or learned (see [73] for more details). Lately there has been a growth on works adopting RPE [12], [52], [60], [74]. We will discuss this in more detail in Sec.…”

Section: Transformer Trends Adopted For Videomentioning

confidence: 99%

“…Beyond captioning, multi-modal translation has been applied to signlanguage translation [78], [136], visual-language dialogue systems [101], spatiotemporal localization [52], visual reasoning [63], or robot navigation [134], just to name a few. Most commonly exploited input modalities for the translation task along with video are text [58], [101], audio [54], [101], and optical flow [52], but others are also used (e.g., human pose [136] or depth [134]). Multi-modal translation generally uses encoder-decoder or decoder-only schemes.…”

Section: Multi-modal Translationmentioning

confidence: 99%

“…For instance, some works apply masking in the encoder instead, in order to focus the attention over subsets tokens [57], [59], [81]. Some other works turn unwanted tokens to 0-valued vectors instead [59] or propose using soft masks to differently weight the contribution of each token [52], [57], [82]. It is important to note that masking is not reducing the computational burden of SA, as the number of tokens is unchanged (see Sec.…”

Section: Activation In Ffnmentioning

confidence: 99%

See 2 more Smart Citations

Video Transformers: A Survey

Selva¹,

Johansen²,

Escalera³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformer models have shown great success modeling long-range interactions. Nevertheless, they scale quadratically with input length and lack inductive biases. These limitations can be further exacerbated when dealing with the high dimensionality of video. Proper modeling of video, which can span from seconds to hours, requires handling long-range interactions. This makes Transformers a promising tool for solving video related tasks, but some adaptations are required. While there are previous works that study the advances of Transformers for vision tasks, there is none that focus on in-depth analysis of video-specific designs. In this survey we analyse and summarize the main contributions and trends for adapting Transformers to model video data. Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones to reduce dimensionality and a predominance of patches and frames as tokens. Furthermore, we study how the Transformer layer has been tweaked to handle longer sequences, generally by reducing the number of tokens in single attention operation. Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches. Finally, we explore how other modalities are integrated with video and conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D CNN counterparts with equivalent FLOPs and no significant parameter increase.

show abstract