Exploiting Informative Video Segments for Temporal Action Localization

Sun, Che; Song, Hao; Wu, Xinxiao; Jia, Yingmin; Luo, Jiebo

doi:10.1109/tmm.2021.3050067

Cited by 22 publications

(6 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Accurate boundary • Lacking temporal modeling [51], [41], [100], [50], [43], [101], [102], frames [88], [46], [103], [92], [18], [104], [103] • Tackling long instances • Separated procedures Classifying [1], [105], [13], [106], [98], [42],…”

Section: Classification Mechanismmentioning

confidence: 99%

“…• Temporal modeling • Insufficient detail Field [111], [90], [46], [104], [112], [88], [113], [18], global relationship [102], [94], [95], [47], [18], [49], [91], [96] • Intra-video diversity representation Inter-video [114], [115] • Representative • Complicated relationship category features training End-to-End [1], [34], [105], [36], [114], [116],…”

Section: Classification Mechanismmentioning

confidence: 99%

See 1 more Smart Citation

Temporal Action Localization in the Deep Learning Era: A Survey

Wang,

Zhao,

Yang

et al. 2024

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

The temporal action localization research aims to discover action instances from untrimmed videos, representing a fundamental step in the field of intelligent video understanding. With the advent of deep learning, backbone networks have been instrumental in providing representative spatiotemporal features, while the end-to-end learning paradigm has enabled the development of high-quality models through data-driven training. Both supervised and weakly supervised learning approaches have contributed to the rapid progress of temporal action localization, resulting in a multitude of methods and a large body of literature, making a comprehensive survey a pressing necessity. This paper presents a thorough analysis of existing action localization works, offering a well-organized taxonomy that highlights the strengths and weaknesses of each strategy. In the realm of supervised learning, in addition to the anchor mechanism, we introduce a novel classification mechanism to categorize and summarize existing works. Similarly, for weakly supervised learning, we extend the traditional pre-classification and post-classification mechanisms by providing a fresh perspective on enhancement strategies. Furthermore, we shed light on the bottleneck of confidence estimation, a critical yet overlooked aspect of current works. By conducting detailed analyses, this survey serves as a valuable resource for researchers, providing beneficial guidance to newcomers and inspiring seasoned researchers alike.

show abstract

Section: Classification Mechanismmentioning

confidence: 99%

Section: Classification Mechanismmentioning

confidence: 99%

Temporal Action Localization in the Deep Learning Era: A Survey

Wang,

Zhao,

Yang

et al. 2024

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…j∈Fi,i̸ =j e cos(fi,fj )/τ j∈F,i̸ =j e cos(fi,fj )/τ (10) where cos denotes the cosine similarity function, F denotes the number of sampled foregrounds and background, f i denotes the ith of F , F i denotes the number of sampled foregrounds/backgrounds, which is similar to f i , and τ denotes the temperature parameter. This loss function will minimise the feature gap of the same categories (foreground/background), maximize the feature gap between foreground and background, and force the attention map to distinguish the foreground/background with the largest difference in the original input features.…”

Section: E Imagewise Contrastive Modulementioning

confidence: 99%

“…Recently, Actionformer [6] based on Transformer achieved the best TAL performance in [5], [6]. However, most TAL methods refine discriminative action boundaries from segment-level semantics [7]- [10],…”

Section: Introductionmentioning

confidence: 99%

Locating X-ray coronary angiogram keyframes via long short-term spatiotemporal attention with image-to-patch contrastive learning

Zhang¹,

Qin²,

Ding³

et al. 2023

Preprint

View full text Add to dashboard Cite

<p>Locating the start, apex and end keyframes of moving contrast agents for keyframe counting during X-ray coronary angiography (XCA) is very important in the diagnosis and treatment of cardiovascular diseases. To locate these keyframes from the class-imbalanced and boundary-agnostic foreground vessel actions that overlap complex backgrounds, we propose long short-term spatiotemporal attention by integrating a convolutional long short-term memory (CLSTM) network into a multiscale Transformer to learn the segment- and sequence-level dependences in the consecutive-frame-based deep features. Image-to-patch contrastive learning is further embedded between the CLSTM-based long-term spatiotemporal attention and Transformer-based short-term attention modules. The imagewise contrastive module reuses the long-term attention to contrast image-level foreground/background of XCA sequence, while patchwise contrastive projection selects the random patches of backgrounds as convolution kernels to project foreground/background frames into different latent spaces. A new XCA video dataset is collected to evaluate the proposed neural network. The experimental results show that the proposed method achieves a mAP of 70.51\% and an F1-score of 0.8188, considerably outperforming the state-of-the-art methods. The source code and dataset are available at https://github.com/Binjie-Qin/STA-IPCon.</p>

show abstract

“…As a single-modal task, temporal action localization aims to classify action instances by predicting the corresponding start timestamps, end timestamps, and action category labels [8], [56], [57]. Existing methods can be divided into one-stage methods [58]- [60] and two-stage methods [61]- [64].…”

Section: B Temporal Action Localizationmentioning

confidence: 99%