Learning Temporal Action Proposals With Fewer Labels

Ji, Jingwei; Cao, Kaidi; Niebles, Juan Carlos

doi:10.1109/iccv.2019.00717

Cited by 38 publications

(37 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Gong et al [176] is a self-supervised method that attained the state-of-the-art results on ActivityNet-1.2 among methods with limited supervision, confirming the advantageous of self-supervised learning. The recent state-of-the-art weakly supervised methods such as D2-Net [174] achieved comparable performance to the semisupervised methods of Ji et al [139] and TTC-Loc [175]. This is interesting specially because D2-Net [174] does not use temporal annotation of actions at all while Ji et al [139] and TTC-Loc [175] use temporal annotations at least for a small percentage of videos in the dataset.…”

Section: Methods With Limited Supervisionmentioning

confidence: 83%

“…In Semi-supervised setting, a small number of videos are fully annotated with the temporal boundary of actions and class labels while a large number of videos are either unlabeled or include only video-level labels. Ji et al [139] employ a fully supervised framework, known as BSN [46], to exploit the small set of labeled data. They encode the input video into a feature sequence and apply sequential perturbations (time warping and time masking [140]) on it.…”

Section: Semi-supervised Action Detectionmentioning

confidence: 99%

See 1 more Smart Citation

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Tian¹

2021

Preprint

View full text Add to dashboard Cite

Understanding human behavior and activity facilitates advancement of numerous real-world applications, and is critical for video analysis. Despite the progress of action recognition algorithms in trimmed videos, the majority of real-world videos are lengthy and untrimmed with sparse segments of interest. The task of temporal activity detection in untrimmed videos aims to localize the temporal boundary of actions and classify the action categories. Temporal activity detection task has been investigated in full and limited supervision settings depending on the availability of action annotations. This paper provides an extensive overview of deep learning-based algorithms to tackle temporal action detection in untrimmed videos with different supervision levels including fully-supervised, weakly-supervised, unsupervised, self-supervised, and semi-supervised. In addition, this paper also reviews advances in spatio-temporal action detection where actions are localized in both temporal and spatial dimensions. Moreover, the commonly used action detection benchmark datasets and evaluation metrics are described, and the performance of the state-of-the-art methods are compared. Finally, real-world applications of temporal action detection in untrimmed videos and a set of future directions are discussed.

show abstract

Section: Methods With Limited Supervisionmentioning

confidence: 83%

Section: Semi-supervised Action Detectionmentioning

confidence: 99%

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Tian¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We apply our method to generate action proposals. Action proposals is an essential part of many methods for action detection, explored by a number of recent papers [8,10,15,[19][20][21]38]. A popular approach to generate action proposals is to estimate an actionness score for each temporal unit and then apply some sort of temporal grouping and non-maxima suppression.…”

Section: Action Proposalsmentioning

confidence: 99%

Learning Actionness via Long-Range Temporal Order Verification

Zhukov

Alayrac

Laptev

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Current methods for action recognition typically rely on supervision provided by manual labeling. Such methods, however, do not scale well given the high burden of manual video annotation and a very large number of possible actions. The annotation is particularly difficult for temporal action localization where large parts of the video present no action, or background. To address these challenges, we here propose a self-supervised and generic method to isolate actions from their background. We build on the observation that actions often follow a particular temporal order and, hence, can be predicted by other actions in the same video. As consecutive actions might be separated by minutes, differently to prior work on the arrow of time, we here exploit long-range temporal relations in 10-20 minutes long videos. To this end, we propose a new model that learns actionness via a self-supervised proxy task of order verification. The model assigns high actionness scores to clips which order is easy to predict from other clips in the video. To obtain a powerful and action-agnostic model, we train it on the large-scale unlabeled HowTo100M dataset with highly diverse actions from instructional videos. We validate our method on the task of action localization and demonstrate consistent improvements when combined with other recent weakly-supervised methods.

show abstract

Section: Temporal Action Proposalsmentioning

confidence: 99%

“…Only two works have so far explored less supervised alternatives: Ji et al [30] and Khatir et al [31]. With a semi-supervised approach in [30], the authors investigate on how the performance of a model is affected when varying the amount of labels used during training. Meanwhile, the model in [31] proposes to extract proposals using an online agglomerative clustering based on distances between consecutive frame features.…”

Section: Temporal Action Proposalsmentioning

confidence: 99%

Unsupervised Action Proposals Using Support Vector Classifiers for Online Video Processing

Rios

López-Sastre

Acevedo-Rodríguez

et al. 2020

Sensors

View full text Add to dashboard Cite

In this work, we introduce an intelligent video sensor for the problem of Action Proposals (AP). AP consists of localizing temporal segments in untrimmed videos that are likely to contain actions. Solving this problem can accelerate several video action understanding tasks, such as detection, retrieval, or indexing. All previous AP approaches are supervised and offline, i.e., they need both the temporal annotations of the datasets during training and access to the whole video to effectively cast the proposals. We propose here a new approach which, unlike the rest of the state-of-the-art models, is unsupervised. This implies that we do not allow it to see any labeled data during learning nor to work with any pre-trained feature on the used dataset. Moreover, our approach also operates in an online manner, which can be beneficial for many real-world applications where the video has to be processed as soon as it arrives at the sensor, e.g., robotics or video monitoring. The core of our method is based on a Support Vector Classifier (SVC) module which produces candidate segments for AP by distinguishing between sets of contiguous video frames. We further propose a mechanism to refine and filter those candidate segments. This filter optimizes a learning-to-rank formulation over the dynamics of the segments. An extensive experimental evaluation is conducted on Thumos’14 and ActivityNet datasets, and, to the best of our knowledge, this work supposes the first unsupervised approach on these main AP benchmarks. Finally, we also provide a thorough comparison to the current state-of-the-art supervised AP approaches. We achieve 41% and 59% of the performance of the best-supervised model on ActivityNet and Thumos’14, respectively, confirming our unsupervised solution as a correct option to tackle the AP problem. The code to reproduce all our results will be publicly released upon acceptance of the paper.

show abstract

Learning Temporal Action Proposals With Fewer Labels

Cited by 38 publications

References 29 publications

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Learning Actionness via Long-Range Temporal Order Verification

Unsupervised Action Proposals Using Support Vector Classifiers for Online Video Processing

Contact Info

Product

Resources

About