Temporal Contrastive Pretraining for Video Action Recognition

Lorre, Guillaume; Rabarisoa, Jaonary; Orcesi, Astrid; Aïnouz, Samia; Canu, Stéphane

doi:10.1109/wacv45572.2020.9093278

Cited by 40 publications

(24 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The goal of a future prediction task is to predict high-level information of future time-step given a series of past ones. In [21,22], high-dimensional data are compressed into a compact lower-dimensional latent embedding space. Powerful autoregressive models are used to summarize the information in the latent space, and a context latent representation C t is produced as represented in Figure 7.…”

Section: Future Predictionmentioning

confidence: 99%

See 1 more Smart Citation

A Survey on Contrastive Self-Supervised Learning

et al. 2020

View full text Add to dashboard Cite

Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudolabels as supervision and use the learned representations for several downstream tasks. Specifically, contrastive learning has recently become a dominant component in self-supervised learning for computer vision, natural language processing (NLP), and other domains. It aims at embedding augmented versions of the same sample close to each other while trying to push away embeddings from different samples. This paper provides an extensive review of self-supervised methods that follow the contrastive approach. The work explains commonly used pretext tasks in a contrastive learning setup, followed by different architectures that have been proposed so far. Next, we present a performance comparison of different methods for multiple downstream tasks such as image classification, object detection, and action recognition. Finally, we conclude with the limitations of the current methods and the need for further techniques and future directions to make meaningful progress.

show abstract

Section: Future Predictionmentioning

confidence: 99%

“…Further, a shallow MLP (1 hidden layer) maps representations to a latent space where a contrastive loss is applied. For training a model for action recognition, the most common approach to extract features from a sequence of image frames is to use a 3D-ResNet as encoder [22,24].…”

Section: Encodersmentioning

confidence: 99%

A Survey on Contrastive Self-Supervised Learning

et al. 2020

View full text Add to dashboard Cite

show abstract

“…For unsupervised representation learning, we are inspired by the success of contrastive learning in images (Chen et al 2020b), short-trimmed videos (Lorre et al 2020;Singh et al 2021) and other areas of machine learning (Chen et al 2021;Rahaman, Ghosh, and Thiery 2021). Works which apply contrastive learning to longer sequences bring together multiple 0% 20% 40% 60% 80% 100% Labeled video (%) 20% 40% 60% 80% MoF Our Semi-Supervised Supervised…”

Section: Introductionmentioning

confidence: 99%

“…The few direct extensions of SimCLR for video (Bai et al 2020;Qian et al 2020;Lorre et al 2020) target action recognition on few seconds short clips. Others integrate contrastive learning by bringing together next-frame feature predictions with actual representations (Kong et al 2020;Lorre et al 2020), using path-object tracks for bringing cycleconsistency (Wang, Zhou, and Li 2020), and considering multiple viewpoints (Sermanet et al 2018) or accompanying modalities like audio (Alwassel et al 2019) or text (Miech et al 2020). We are inspired by these works to develop contrastive learning for long-range segmentation.…”

Section: Introductionmentioning

confidence: 99%

Iterative Contrast-Classify For Semi-supervised Temporal Action Segmentation

Singhania¹,

Rahaman²,

Yao³

2021

Preprint

View full text Add to dashboard Cite

Temporal action segmentation classifies the action of each frame in (long) video sequences. Due to the high cost of framewise labeling, we propose the first semi-supervised method for temporal action segmentation. Our method hinges on unsupervised representation learning, which, for temporal action segmentation, poses unique challenges. Actions in untrimmed videos vary in length and have unknown labels and start/end times. Ordering of actions across videos may also vary. We propose a novel way to learn frame-wise representations from temporal convolutional networks (TCNs) by clustering input features with added time-proximity condition and multiresolution similarity. By merging representation learning with conventional supervised learning, we develop an "Iterative-Contrast-Classify (ICC)" semi-supervised learning scheme. With more labelled data, ICC progressively improves in performance; ICC semi-supervised learning, with 40% labelled videos, performs similar to fully-supervised counterparts. Our ICC improves MoF by {+1.8, +5.6, +2.5}% on Breakfast, 50Salads and GTEA respectively for 100% labelled videos.

show abstract

“…Previously, contrastive learning is widely adopted in image representation learning. For image representation learning, multiple methods Inspired by the success of contrastive learning in images, recent methods [18,19,[79][80][81] have been proposed to leverage contrastive learning for video representation learning. For instance, [18]…”

Section: Contrastive Learningmentioning

confidence: 99%

Effective action recognition with fully supervised and self-supervised methods

Cao¹

View full text Add to dashboard Cite

Action recognition in videos has attracted interest in computer vision and machine learning communities thanks to its applications such as surveillance and smart homes. In addition to spatial information in individual frames, videos contain temporal information across the temporal dimension. Therefore, effective spatiotemporal representation is the key to accurate action recognition in videos.Previous works have proposed various fully-supervised and self-supervised methods for video representation learning. For fully-supervised methods, most of them utilize convolution neural networks (CNNs) to extract spatial representation while temporal representation is usually modelled by pixel-wise correlations. However, it is inefficient to extract correlations between all pixels since some of them may relate to in-salient area (e.g. backgrounds or environments). On the other hand, self-supervised methods are proposed to leverage more accessible unlabled data on the Internet and transfer the extracted representation for different downstream tasks. The core of self-supervised methods is to design a pretext task where supervision signal is automatically generated based on characteristics of unlabeled data.Although self-supervised methods avoid the annotation of labeled data, compared to fully-supervised methods, there is room for performance improvement of selfsupervised methods. In this thesis, we address the above research gap with two novel deep learning methods, to advance fully-supervised and self-supervised methods, respectively. For fully-supervised learning, we propose a novel Key Point Shift Embedding Module (KPSEM) to adaptively extract channel-wise key point shifts across video frames without key point annotation for temporal feature extraction. Key points are adaptively extracted as feature points with maximum feature values at split regions, while key point shifts are the spatial displacements of corresponding key points. The key point shifts are encoded as the overall temporal features via linear embedding layers in a multi-set manner. vi To advance self-supervised learning, we propose a novel self-supervised learning method, called Video Incoherence Detection (VID), that leverages incoherence detection for spatio-temporal feature extraction. It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, the training sample, denoted as the incoherent clip, is constructed by multiple sub-clips hierarchically sampled from the same raw video with various lengths of incoherence between each other. The network is trained to learn high-level representation by predicting the relative location and length of incoherence given the incoherent clip as input. Additionally, intra-video contrastive learning is introduced to maximize the mutual information between different incoherent clips from the same raw video. Our experiments show that both KPSEM and VID achieve state-of-the-art performance on action recognition wi...

show abstract

Temporal Contrastive Pretraining for Video Action Recognition

Cited by 40 publications

References 18 publications

A Survey on Contrastive Self-Supervised Learning

A Survey on Contrastive Self-Supervised Learning

Iterative Contrast-Classify For Semi-supervised Temporal Action Segmentation

Effective action recognition with fully supervised and self-supervised methods

Contact Info

Product

Resources

About