Weakly-Supervised Action Localization by Generative Attention Modeling

Shi, Baifeng; Dai, Qi; Mu, Yadong; Wang, Jingdong

doi:10.1109/cvpr42600.2020.00109

Cited by 148 publications

(80 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Localization After obtaining the TCAM a for a query video and class , we threshold the a and group together consecutive snippets that are above a given threshold . Then, following the standard practice [30,34,38], we arrive at a set of action predictions ( , , ), where , and are the start, end, and prediction score of a certain prediction. We set the prediction score as the average of a of the individual snippets, that is, = 1 − +1 +1 = a ( ).…”

Section: Localization and Classificationmentioning

confidence: 99%

“…Our TCAMs are similar in functionality with those in other weakly supervised works [31,32,38], however, a crucial difference is that, in our case, TCAMs are calculated based on similarities with reference videos as in [21], and not from class-based classifiers that are hardly trained from one/few examples. During training, we optimize a classification loss at video level, in order to ensure the interclass separability of learned features.…”

Section: Introductionmentioning

confidence: 97%

“…A central challenge in this field, is the difficulty in obtaining large scale, fully annotated data, where the temporal extend of the different actions are given as ground truth. To address this issue, several recent works have appeared on topics such as weakly supervised localization [26,30,34,38,40,45], few-shot action detection [49,51] and video re-localization [9,15,52].…”

Section: Introductionmentioning

confidence: 99%

“…To address the problem that temporal annotation of action borders is a time-consuming task, several weakly-supervised learning methods [30,34,38] have been proposed. These methods split the video into snippets (e.g., 16 frames) and perform classification at snippet level to obtain temporal class activation maps (TCAMs) [57].…”

Section: Introductionmentioning

confidence: 99%

“…Those maps are used during training as attention mechanisms to refine the classifiers, and during testing to localize the actions. However, such methods [30,34,38] rely on classifiers that are learned for the classes that are present in the training set each of which has typically several samples. This is very different from the one/fewshot learning framework, where only one/few samples are available for the classes in the test set; in such cases, training a classifier is impractical/prone to over-fitting.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Few-Shot Action Localization without Knowing Boundaries

Xie

Tzelepis

et al. 2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Learning to localize actions in long, cluttered, and untrimmed videos is a hard task, that in the literature has typically been addressed assuming the availability of large amounts of annotated training samples for each class -either in a fully-supervised setting, where action boundaries are known, or in a weakly-supervised setting, where only class labels are known for each video. In this paper, we go a step further and show that it is possible to learn to localize actions in untrimmed videos when a) only one/few trimmed examples of the target action are available at test time, and b) when a large collection of videos with only class label annotation (some trimmed and some weakly annotated untrimmed ones) are available for training; with no overlap between the classes used during training and testing. To do so, we propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a finegrained similarity pattern between pairs of videos (trimmed or untrimmed), and uses them to generate Temporal Class Activation Maps (TCAMs) for seen or unseen classes. The TCAMs serve as temporal attention mechanisms to extract video-level representations of untrimmed videos, and to temporally localize actions at test time. To the best of our knowledge, we are the first to propose a weakly-supervised, one/few-shot action localization network that can be trained in an end-to-end fashion. Experimental results on THUMOS14 and ActivityNet1.2 datasets, show that our method achieves performance comparable or better to state-of-the-art fullysupervised, few-shot learning methods. CCS CONCEPTS• Computing methodologies → Machine learning.

show abstract

Section: Localization and Classificationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 97%