Adversarial Seeded Sequence Growing for Weakly-Supervised Temporal Action Localization

Zhang, Chengwei; Xu, Yunlu; Cheng, Zhanzhan; Niu, Yi; Pu, Shiliang; Wu, Fei; Zou, Futai

doi:10.1145/3343031.3351044

Cited by 50 publications

(13 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the rapid development of artificial intelligence techniques [18], [19], [20], [21], great progress has been made in many isolated applications such as causal inference [22], named entities identification [23], question answering [24], scene text spotting [5], [6], [17] and video understanding [25], [26]. However, it is very important to build multiple knowledge representation [27] for understanding the real and complex world.…”

Section: Related Workmentioning

confidence: 99%

FREE: A Fast and Robust End-to-End Video Text Spotter

Cheng

Lü

Zou

et al. 2021

IEEE Trans. on Image Process.

Self Cite

View full text Add to dashboard Cite

Currently, video text spotting tasks usually fall into the four-staged pipeline: detecting text regions in individual images, recognizing localized text regions frame-wisely, tracking text streams and post-processing to generate final results. However, they may suffer from the huge computational cost as well as suboptimal results due to the interferences of low-quality text and the none-trainable pipeline strategy. In this paper, we propose a fast and robust end-to-end video text spotting framework named FREE by only recognizing the localized text stream onetime instead of frame-wise recognition. Specifically, FREE first employs a well-designed spatial-temporal detector that learns text locations among video frames. Then a novel text recommender is developed to select the highest-quality text from text streams for recognizing. Here, the recommender is implemented by assembling text tracking, quality scoring and recognition into a trainable module. It not only avoids the interferences from the low-quality text but also dramatically speeds up the video text spotting. FREE unites the detector and recommender into a whole framework, and helps achieve global optimization. Besides, we collect a large scale video text dataset for promoting the video text spotting community, containing 100 videos from 21 real-life scenarios. Extensive experiments on public benchmarks show our method greatly speeds up the text spotting process, and also achieves the remarkable state-of-the-art.

show abstract

Section: Related Workmentioning

confidence: 99%

FREE: A Fast and Robust End-to-End Video Text Spotter

Cheng

Lü

Zou

et al. 2021

IEEE Trans. on Image Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…To handle the two problems, existing methods can be divided into three types. The first type of works attempt to solve the localization completeness by applying a well-designed erasing strategy [37,55,53] or a multi-branch architecture [21]. For example, Zhong et al [55] design a stepby-step erasion approach to train the one-by-one classifiers, via collecting detection results from these classifiers, more action segments are found.…”

Section: Related Workmentioning

confidence: 99%

“…To relieve this problem, the weakly supervised setting that only requires video-level category labels is proposed [37,55,39,37,55,53,29,30,45,46]. It can be formulated as a multiple instance learning problem, where a video is treated as a bag of multiple segments and fed into a video-level classifier to get a class activation sequence (CAS).…”

Section: Introductionmentioning

confidence: 99%

“…There are two primary challenges, named localization completeness and background interference. To solve the first challenge, previous works usually adopt a well-designed erasing strategy [37,55,53] or a multi-branch architecture [21]. Both of them aim to force the model to concentrate on different parts of videos and hence discover the whole action without missing any relevant segments.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Action Unit Memory Network for Weakly Supervised Temporal Action Localization

Luo¹,

Zhang²,

Yang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Weakly supervised temporal action localization aims to detect and localize actions in untrimmed videos with only video-level labels during training. However, without framelevel annotations, it is challenging to achieve localization completeness and relieve background interference. In this paper, we present an Action Unit Memory Network (AUMN) for weakly supervised temporal action localization, which can mitigate the above two challenges by learning an action unit memory bank. In the proposed AUMN, two attention modules are designed to update the memory bank adaptively and learn action units specific classifiers. Furthermore, three effective mechanisms (diversity, homogeneity and sparsity) are designed to guide the updating of the memory network. To the best of our knowledge, this is the first work to explicitly model the action units with a memory network. Extensive experimental results on two standard benchmarks (THUMOS14 and ActivityNet) demonstrate that our AUMN performs favorably against stateof-the-art methods. Specifically, the average mAP of IoU thresholds from 0.1 to 0.5 on the THUMOS14 dataset is significantly improved from 47.0% to 52.1%.

show abstract

“…The past decade has witnessed the great efforts in action understanding [1,2,3,4,5], among which action detection is receiving the most considerable attention [6,7,8,9,10,11,12]. Action detection targets predicting if an action occurs in a video that has its complete observation; meanwhile, finding the relevant spatial-temporal location.…”

Section: Introductionmentioning

confidence: 99%

Human-Aware Coarse-to-Fine Online Action Detection

Yang

Huang

Qin

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work, we propose a two-stage framework to efficiently and effectively detect actions on-the-fly. An action location network (ALN) is developed in the first stage to judge whether the current frame is action-related, while the second stage involves an action classification network (ACN) to further identify the action category. In this way, irrelevant negative frames are quickly discarded and actions are detected as early as they occur. Moreover, we highlight human areas at both the stages by respectively incorporating a human detector and a human mask layer. As a result, more accurate spatial-temporal windows of actions are detected, based on which more robust features are extracted for classification. Experimental results on two popular benchmarks demonstrate the superior performance of the proposed approach.

show abstract

Adversarial Seeded Sequence Growing for Weakly-Supervised Temporal Action Localization

Cited by 50 publications

References 28 publications

FREE: A Fast and Robust End-to-End Video Text Spotter

FREE: A Fast and Robust End-to-End Video Text Spotter

Action Unit Memory Network for Weakly Supervised Temporal Action Localization

Human-Aware Coarse-to-Fine Online Action Detection

Contact Info

Product

Resources

About