Semi-Supervised Action Recognition with Temporal Contrastive Learning

Singh, Ankit; Chakraborty, Omprakash; Varshney, Ashutosh; Panda, Rameswar; Feris, Rogério; Saenko, Kate; Das, Abir

doi:10.48550/arxiv.2102.02751

Cited by 2 publications

(5 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our CMPL framework, this is done by feeding the primary backbone F (•) and the auxiliary network A(•) clips from different temporal locations of the same video while still requiring them to supervise each other. Meanwhile, we also follow [22,40] in regarding different frame rates as a form of temporal augmentation. This is also illustrated in Fig.…”

Section: Methodsmentioning

confidence: 99%

“…[13] introduces a new framework that leverages a 2D image classifier to assist action recognition. [22] proposes a temporal contrastive learning framework to model temporal aspects by comparing the same video at different speeds.…”

Section: Related Workmentioning

confidence: 99%

“…For Fixmatch [23], we adopt the same experimental settings as our approach. Finally, we also include the state-of-the-art video-based semi-supervised learning methods [14,22,38], whose performances from their original papers are directly reported.…”

Section: Settingsmentioning

confidence: 99%

“…2 suggests that our CMPL obtains more high-quality pseudo-labels than the baseline. In addition, we also study the com-patibility of this cross-model framework with conventional temporal data augmentations (i.e., adjusting the temporal location and the frame rate), which are widely used in recent literature [6,22,40]. Experiments on a range of standard benchmarks and training settings demonstrate the effectiveness of our CMPL.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition

Xu¹,

Wei²,

Sun³

et al. 2021

Preprint

View full text Add to dashboard Cite

Semi-supervised action recognition is a challenging but important task due to the high cost of data annotation. A common approach to this problem is to assign unlabeled data with pseudo-labels, which are then used as additional supervision in training. Typically in recent work, the pseudo-labels are obtained by training a model on the labeled data, and then using confident predictions from the model to teach itself. In this work, we propose a more effective pseudo-labeling scheme, called Cross-Model Pseudo-Labeling (CMPL). Concretely, we introduce a lightweight auxiliary network in addition to the primary backbone, and ask them to predict pseudo-labels for each other. We observe that, due to their different structural biases, these two models tend to learn complementary representations from the same video clips. Each model can thus benefit from its counterpart by utilizing cross-model predictions as supervision. Experiments on different data partition protocols demonstrate the significant improvement of our framework over existing alternatives. For example, CMPL achieves 17.6% and 25.1% Top-1 accuracy on Kinetics-400 and UCF-101 using only the RGB modality and 1% labeled data, outperforming our baseline model, FixMatch [23], by 9.0% and 10.3%, respectively. 1

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Settingsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition

Xu¹,

Wei²,

Sun³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, Jing et al [27] and Singh et al [47] propose to adapt the SSL framework to the video domain. They focus on algorithmic improvement for video SSL.…”

Section: Data Augmentationmentioning

confidence: 99%

Learning Representational Invariances for Data-Efficient Action Recognition

Zou¹,

Choi²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Data augmentation is a ubiquitous technique for improving image classification when labeled data is scarce. Constraining the model predictions to be invariant to diverse data augmentations effectively injects the desired representational invariances to the model (e.g., invariance to photometric variations), leading to improved accuracy. Compared to image data, the appearance variations in videos are far more complex due to the additional temporal dimension. Yet, data augmentation methods for videos remain under-explored. In this paper, we investigate various data augmentation strategies that capture different video invariances, including photometric, geometric, temporal, and actor/scene augmentations. When integrated with existing consistency-based semi-supervised learning frameworks, we show that our data augmentation strategy leads to promising performance on the Kinetics-100, UCF-101, and HMDB-51 datasets in the low-label regime. We also validate our data augmentation strategy in the fully supervised setting and demonstrate improved performance.

show abstract

Semi-Supervised Action Recognition with Temporal Contrastive Learning

Cited by 2 publications

References 39 publications

Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition

Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition

Learning Representational Invariances for Data-Efficient Action Recognition

Contact Info

Product

Resources

About