End-to-End Temporal Action Detection With Transformer

Liu, Xiaolong; Hu, Yao; Tang, Xu; Bai, Song; Bai, Xiang

doi:10.1109/tip.2022.3195321

Cited by 109 publications

(26 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As shown in Tab. 2, when using one modality as input, our model variants that only apply self-attention in the encoder outperform all compared TAL methods, where TadTR [25] and ActionFormer [48] also use an end-to-end transformer-based architecture. When using both audio and visual modalities, the performance of our model boosts significantly, e.g., +11.9% and +10.7% at the average mAP compared with our visual-only and audioonly variants, respectively.…”

Section: Results and Analysismentioning

confidence: 82%

“…By contrast, single-stage TAL localizes actions in a single shot without using pre-generated proposals, including anchor-based [26] and anchor-free methods [19,47]. Besides, Transformers [41], with its powerful ability of long-range relation modeling, are recently also considered in some single-stage TAL methods [25,36,48]. Sound event detection (SED) focuses on recognizing and locating audio events in pure acoustic environments [27].…”

Section: Uni-modal Temporal Localization Tasksmentioning

confidence: 99%

“…2. It includes the two-stage model VSGN [49] and single-stage models (TadTR [25] and ActionFormer [48]). Here, L s = 2 and L c = 6 in the pyramid transformer encoder of our model.…”

Section: Experimental Settingsmentioning

confidence: 99%

See 2 more Smart Citations

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Geng¹,

Wang²,

Duan³

et al. 2023

Preprint

View full text Add to dashboard Cite

Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of denselocalizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task.

show abstract

Section: Results and Analysismentioning

confidence: 82%

Section: Uni-modal Temporal Localization Tasksmentioning

confidence: 99%

See 1 more Smart Citation

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Geng¹,

Wang²,

Duan³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Vid2Seq achieves stateof-the-art results on various dense event captioning datasets, as well as multiple video paragraph captioning and standard video clip captioning benchmarks. Finally, we believe the sequence-to-sequence design of Vid2Seq has the potential to be extended to a wide range of other video tasks such as temporally-grounded video question answering [51,56,57] or temporal action localization [16,67,123].…”

Section: Discussionmentioning

confidence: 99%

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Yang¹,

Nagrani²,

Seo³

et al. 2023

Preprint

View full text Add to dashboard Cite

“…However, the predicted proposal relies heavily on local information and does not make full use of context relations. In order to model long-range context, some current works, such as the RTD-Net [ 5 ] and TadTR [ 28 ], regard video as a temporal sequence and introduce a self-attention transformer structure. Because using the attention mechanism in the whole sequence is inefficient and will introduce irrelevant noise interference, ActionFormer [ 4 ] proposed a local attention mechanism that limits the attention range within a fixed window.…”

Section: Related Workmentioning

confidence: 99%

Non-Local Temporal Difference Network for Temporal Action Detection

Han

Zhong

et al. 2022

Sensors

View full text Add to dashboard Cite

As an important part of video understanding, temporal action detection (TAD) has wide application scenarios. It aims to simultaneously predict the boundary position and class label of every action instance in an untrimmed video. Most of the existing temporal action detection methods adopt a stacked convolutional block strategy to model long temporal structures. However, most of the information between adjacent frames is redundant, and distant information is weakened after multiple convolution operations. In addition, the durations of action instances vary widely, making it difficult for single-scale modeling to fit complex video structures. To address this issue, we propose a non-local temporal difference network (NTD), including a chunk convolution (CC) module, a multiple temporal coordination (MTC) module, and a temporal difference (TD) module. The TD module adaptively enhances the motion information and boundary features with temporal attention weights. The CC module evenly divides the input sequence into N chunks, using multiple independent convolution blocks to simultaneously extract features from neighboring chunks. Therefore, it realizes the information delivered from distant frames while avoiding trapping into the local convolution. The MTC module designs a cascade residual architecture, which realizes the multiscale temporal feature aggregation without introducing additional parameters. The NTD achieves a state-of-the-art performance on two large-scale datasets, 36.2% mAP@avg and 71.6% mAP@0.5 on ActivityNet-v1.3 and THUMOS-14, respectively.

show abstract

End-to-End Temporal Action Detection With Transformer

Cited by 109 publications

References 64 publications

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Non-Local Temporal Difference Network for Temporal Action Detection

Contact Info

Product

Resources

About