Weakly Supervised Temporal Action Localization through Contrast based Evaluation Networks

Liu, Ziyi; Wang, Le; Zhang, Qilin; Tang, Wei; Zheng, Nanning; Hua, Gang

doi:10.1109/tpami.2021.3078798

Cited by 20 publications

(24 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…5 the pseudo-labeling scheme, without auxiliary losses to regularize the learning process. Notably while some models like untrimmedNets [33] use a different backbone (TSN and TCN), most recent models [18,37,20,22] use the same two-stream I3D feature extraction backbone as our model does, thus are fair comparison from the feature extraction aspect. Compared to the best result among the four recent models [18,37,20,22], we get 3% significant improvement at mAP@0.5.…”

Section: Methodsmentioning

confidence: 98%

“…Notably while some models like untrimmedNets [33] use a different backbone (TSN and TCN), most recent models [18,37,20,22] use the same two-stream I3D feature extraction backbone as our model does, thus are fair comparison from the feature extraction aspect. Compared to the best result among the four recent models [18,37,20,22], we get 3% significant improvement at mAP@0.5. Our model also shows more significant improvement at higher threshold metrics tIoU=0.6 and tIoU=0.7, which implies our action proposals are more complete.…”

Section: Methodsmentioning

confidence: 98%

“…Instead of post-grouping, AutoLoc [26] trains a boundary predictor using the high T-CAS activation regions of a pre-trained UntrimmedNet [33] as ground truth. In this way, the model can directly output the action start and end point without grouping, and Clean-Net [18] follows a similar pipeline.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

Luo

Guillory

Shi

et al. 2020

Preprint

View full text Add to dashboard Cite

Weakly-supervised action localization problem requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag's label is known, the main challenge is to assign which key instances within the bag trigger the bag's label. Most previous models use an attention-based approach. These models use attention to generate bag's representation from instances and then train it via bag's classification. In this work, we explicitly model the key instances assignment as a hidden variable and adopt an Expectation-Maximization framework. We derive two pseudo-label generation schemes to model the E and M process and iteratively optimize the likelihood lower bound. We also show that previous attention-based models implicitly violate the MIL assumptions that instances in negative bags should be uniformly negative. In comparison, Our EM-MIL approach more accurately models these assumptions. Our model achieves state-of-the-art performance on two standard benchmarks, THUMOS14 and ActivityNet1.2, and shows the superiority of detecting relative complete action boundary in videos containing multiple actions.

show abstract

Section: Methodsmentioning

confidence: 98%

Section: Methodsmentioning

confidence: 98%

See 1 more Smart Citation

Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

Luo

Guillory

Shi

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…We then compute final predictions by applying non-maximum suppression to eliminate overlapping and similar proposals. We compare our approach with an extensive set of leading recent baselines: TSM [35], CMCS [19], MAAN [36], 3C-Net [23], CleanNet [20], BaSNet [14], BM [25], DGAM [28], TSCN [37] and EM-MIL [22]. Details for each baseline can be found in the related work section, and we directly use the results reported by the respective authors.…”

Section: Methodsmentioning

confidence: 99%

Weakly Supervised Action Selection Learning in Video

Ma¹,

Gorti²,

Volkovs³

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Localizing actions in video is a core task in computer vision. The weakly supervised temporal localization problem investigates whether this task can be adequately solved with only video-level labels, significantly reducing the amount of expensive and error-prone annotation that is required. A common approach is to train a frame-level classifier where frames with the highest class probability are selected to make a video-level prediction. Frame-level activations are then used for localization. However, the absence of frame-level annotations cause the classifier to impart class bias on every frame. To address this, we propose the Action Selection Learning (ASL) approach to capture the general concept of action, a property we refer to as "actionness". Under ASL, the model is trained with a novel class-agnostic task to predict which frames will be selected by the classifier. Empirically, we show that ASL outperforms leading baselines on two popular benchmarks THUMOS-14 and ActivityNet-1.2, with 10.3% and 5.7% relative improvement respectively. We further analyze the properties of ASL and demonstrate the importance of actionness. Full code for this work is available here: https://github.com/layer6ai-labs/ASL.

show abstract

“…Nguyen et al [22] introduced a sparsity regularization for video-level classification. Shou et al [26] and Liu [17] investigated score contrast in the temporal dimension. Hideand-Seek [29] randomly removed frame sequences during training to force the network to respond to multiple relevant parts.…”

Section: Related Workmentioning

confidence: 99%

SF-Net: Single-Frame Supervision for Temporal Action Localization

Zhu

Yang

et al. 2020

Preprint

View full text Add to dashboard Cite

In this paper, we study an intermediate form of supervision, i.e., single-frame supervision, for temporal action localization (TAL). To obtain the single-frame supervision, the annotators are asked to identify only a single frame within the temporal window of an action. This can significantly reduce the labor cost of obtaining full supervision which requires annotating the action boundary. Compared to the weak supervision that only annotates the video-level label, the single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead. To make full use of such single-frame supervision, we propose a unified system called SF-Net. First, we propose to predict an actionness score for each video frame. Along with a typical category score, the actionness score can provide comprehensive information about the occurrence of a potential action and aid the temporal boundary refinement during inference. Second, we mine pseudo action and background frames based on the single-frame annotations. We identify pseudo action frames by adaptively expanding each annotated single frame to its nearby, contextual frames and we mine pseudo background frames from all the unannotated frames across multiple videos. Together with the ground-truth labeled frames, these pseudo-labeled frames are further used for training the classifier. In extensive experiments on THUMOS14, GTEA, and BEOID, SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization. Notably, SF-Net achieves comparable results to its fully-supervised counterpart which requires much more resource intensive annotations.

show abstract

Weakly Supervised Temporal Action Localization through Contrast based Evaluation Networks

Cited by 20 publications

References 55 publications

Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

Weakly Supervised Action Selection Learning in Video

SF-Net: Single-Frame Supervision for Temporal Action Localization

Contact Info

Product

Resources

About