Weakly Supervised Action Localization by Sparse Temporal Pooling Network

Nguyen, Phuc; Han, Bohyung; Liu, Ting; Prasad, Gautam

doi:10.1109/cvpr.2018.00706

Cited by 366 publications

(305 citation statements)

References 43 publications

Supporting

Mentioning

292

Contrasting

Unclassified

Order By: Relevance

“…UntrimmedNets [28] employs attention mechanisms to learn the pattern of precut action segments. STPN [29] utilizes a sparsity constraint to detect the activities, which improves the performance of action localization. TSR-Net [30] integrates self-attention and transfer learning with temporal localization framework to obtain precise temporal intervals in untrimmed videos.…”

Section: B Action Localizationmentioning

confidence: 99%

AdapNet: Adaptability Decomposing Encoder–Decoder Network for Weakly Supervised Action Recognition and Localization

Zhang

Shi

et al. 2023

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

The point process is a solid framework to model sequential data, such as videos, by exploring the underlying relevance. As a challenging problem for high-level video understanding, weakly supervised action recognition and localization in untrimmed videos has attracted intensive research attention. Knowledge transfer by leveraging the publicly available trimmed videos as external guidance is a promising attempt to make up for the coarse-grained video-level annotation and improve the generalization performance. However, unconstrained knowledge transfer may bring about irrelevant noise and jeopardize the learning model. This paper proposes a novel adaptability decomposing encoder-decoder network to transfer reliable knowledge between trimmed and untrimmed videos for action recognition and localization via bidirectional point process modeling, given only video-level annotations. By decomposing the original features into domain-adaptable and domain-specific ones based on their adaptability, trimmed-untrimmed knowledge transfer can be safely confined within a more coherent subspace. An encoder-decoder based structure is carefully designed and jointly optimized to facilitate effective action classification and temporal localization. Extensive experiments are conducted on two benchmark datasets (i.e., THUMOS14 and ActivityNet1.3), and experimental results clearly corroborate the efficacy of our method.

show abstract

Section: B Action Localizationmentioning

confidence: 99%

AdapNet: Adaptability Decomposing Encoder–Decoder Network for Weakly Supervised Action Recognition and Localization

Zhang

Shi

et al. 2023

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

show abstract

“…Existing approaches have investigated different weak supervision strategies for action localization. The work of [25,14,28] use action category labels in videos for temporal localization, whereas [13] uses point-level supervision to spatio-temporally localize the actions. [17,2] exploit the order of actions in a video as a weak supervision cue.…”

Section: Related Workmentioning

confidence: 99%

“…Weakly-supervised temporal action localization has been investigated using different types of weak labels, e.g., action categories [25,28,14], movie scripts [12,1] and sparse spatio-temporal points [13]. Recently, Paul et al [16] proposed an action localization approach, demonstrating stateof-the-art results, using video-level category labels as the weak supervision.…”

Section: Introductionmentioning

confidence: 99%

“…We propose a framework, called 3C-Net, using a novel formulation to learn discriminative action features with enhanced localization capabilities using video-level supervision. As in [14,16], our formulation contains a classification loss term that ensures the inter-class separability of learned features, for video-level action classification. However, this separability at the global video-level alone is insufficient for accurate action localization, which is generally a local temporal-context classification.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

3C-Net: Category Count and Center Loss for Weakly-Supervised Action Localization

Narayan

Cholakkal

Khan

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

161

118

View full text Add to dashboard Cite

Temporal action localization is a challenging computer vision problem with numerous real-world applications. Most existing methods require laborious frame-level supervision to train action localization models. In this work, we propose a framework, called 3C-Net, which only requires video-level supervision (weak supervision) in the form of action category labels and the corresponding count. We introduce a novel formulation to learn discriminative action features with enhanced localization capabilities. Our joint formulation has three terms: a classification term to ensure the separability of learned action features, an adapted multi-label center loss term to enhance the action feature discriminability and a counting loss term to delineate adjacent action sequences, leading to improved localization. Comprehensive experiments are performed on two challenging benchmarks: THUMOS14 and ActivityNet 1.2. Our approach sets a new state-of-the-art for weakly-supervised temporal action localization on both datasets. On the THU-MOS14 dataset, the proposed method achieves an absolute gain of 4.6% in terms of mean average precision (mAP), compared to the state-of-the-art [16]. Source code is available at https://github.com/naraysa/3c-net.

show abstract

“…Li et al [19] apply attention for action recognition and action detection in untrimmed sequences, using features from multiple modalities as the input to the temporal attention LSTM before softmax normalisation. Nguyen et al [25] learn attention for action classification. They normalise the attention scores by a sigmoid function, and then use these to estimate the discriminative class-specific temporal regions for localising actions.…”

Section: Related Workmentioning

confidence: 99%

Weakly-Supervised Completion Moment Detection using Temporal Attention

Heidarivincheh

Mirmehdi

Damen

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Monitoring the progression of an action towards completion offers fine grained insight into the actor's behaviour. In this work, we target detecting the completion moment of actions, that is the moment when the action's goal has been successfully accomplished. This has potential applications from surveillance to assistive living and human-robot interactions. Previous effort [14] required human annotations of the completion moment for training (i.e. full supervision). In this work, we present an approach for moment detection from weak video-level labels. Given both complete and incomplete sequences, of the same action, we learn temporal attention, along with accumulated completion prediction from all frames in the sequence. We also demonstrate how the approach can be used when completion moment supervision is available. We evaluate and compare our approach on actions from three datasets, namely HMDB, UCF101 and RGBD-AC, and show that temporal attention improves detection in both weakly-supervised and fully-supervised settings.

show abstract

Weakly Supervised Action Localization by Sparse Temporal Pooling Network

Cited by 366 publications

References 43 publications

AdapNet: Adaptability Decomposing Encoder–Decoder Network for Weakly Supervised Action Recognition and Localization

AdapNet: Adaptability Decomposing Encoder–Decoder Network for Weakly Supervised Action Recognition and Localization

3C-Net: Category Count and Center Loss for Weakly-Supervised Action Localization

Weakly-Supervised Completion Moment Detection using Temporal Attention

Contact Info

Product

Resources

About