TCLR: Temporal contrastive learning for video representation

Dave, Ishan Rajendrakumar; Gupta, Rohit; Rizve, Mamshad Nayeem; Shah, Mubarak

doi:10.1016/j.cviu.2022.103406

Cited by 86 publications

(42 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A discriminator is used to predict high probabilities on similar pairs and low probabilities for dissimilar pairs. To focus on both local and global representations, Dave et al [2022] use a local loss that treats non-overlapping clips as negatives and spatially augmented versions of the same clip as positives and a global-local loss that maximize the similarity between the local representation of the entire clip and the global representations of the corresponding sub-clip.…”

Section: Spatio-temporal Augmentationmentioning

confidence: 99%

Self-Supervised Learning for Videos: A Survey

Schiappa¹,

Rawat²,

Shah³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, the use of human-generated annotations leads to models with biased learning, poor domain generalization, and poor robustness. Obtaining annotations is also expensive and requires great effort, which is especially challenging for videos. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for exclusive ideas which can advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into three different categories based on their learning objectives: 1) pre-text tasks, 2) generative modeling, and 3) contrastive learning. These approaches also differ in terms of the modality which are being used; 1) video, 2) video-audio, 3) video-text, and 4) video-audio-text. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.

show abstract

Section: Spatio-temporal Augmentationmentioning

confidence: 99%

Self-Supervised Learning for Videos: A Survey

Schiappa¹,

Rawat²,

Shah³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…For example, Li et al (2021) proposed JigsawGAN to learn semantic information and edge information of images, which is a GAN-based self-supervised method for solving jigsaw puzzles with unpaired images. Contrast learning ( Tian et al, 2020 ; Wang et al, 2021 ; Dave et al, 2022 ) can be regarded as a discriminative method which aims to group positive samples and separate negative samples. Dave et al (2022) developed a new temporal contrastive learning framework comprising local–local and local–global temporal contrastive loss to encourage the features to be distinct across the temporal dimension.…”

Section: Related Workmentioning

confidence: 99%

“…Contrast learning ( Tian et al, 2020 ; Wang et al, 2021 ; Dave et al, 2022 ) can be regarded as a discriminative method which aims to group positive samples and separate negative samples. Dave et al (2022) developed a new temporal contrastive learning framework comprising local–local and local–global temporal contrastive loss to encourage the features to be distinct across the temporal dimension. Generative model-based approaches usually use some generative tasks as pretext tasks to learn features, such as image reconstruction ( Fan et al, 2022 ), image inpainting ( Quan et al, 2022 ), image coloring ( Bi et al, 2021 ), etc.…”

Section: Related Workmentioning

confidence: 99%

A layer-wise fusion network incorporating self-supervised learning for multimodal MR image synthesis

Zhou

Zou

2022

Front. Genet.

View full text Add to dashboard Cite

Magnetic resonance (MR) imaging plays an important role in medical diagnosis and treatment; different modalities of MR images can provide rich and complementary information to improve the accuracy of diagnosis. However, due to the limitations of scanning time and medical conditions, certain modalities of MR may be unavailable or of low quality in clinical practice. In this study, we propose a new multimodal MR image synthesis network to generate missing MR images. The proposed model comprises three stages: feature extraction, feature fusion, and image generation. During feature extraction, 2D and 3D self-supervised pretext tasks are introduced to pre-train the backbone for better representations of each modality. Then, a channel attention mechanism is used when fusing features so that the network can adaptively weigh different fusion operations to learn common representations of all modalities. Finally, a generative adversarial network is considered as the basic framework to generate images, in which a feature-level edge information loss is combined with the pixel-wise loss to ensure consistency between the synthesized and real images in terms of anatomical characteristics. 2D and 3D self-supervised pre-training can have better performance on feature extraction to retain more details in the synthetic images. Moreover, the proposed multimodal attention feature fusion block (MAFFB) in the well-designed layer-wise fusion strategy can model both common and unique information in all modalities, consistent with the clinical analysis. We also perform an interpretability analysis to confirm the rationality and effectiveness of our method. The experimental results demonstrate that our method can be applied in both single-modal and multimodal synthesis with high robustness and outperforms other state-of-the-art approaches objectively and subjectively.

show abstract

“…The identify of the animals is discarded and the semantic structure is preserved, as evidenced by the fact that the two red dots are very close to one another. recognition using 3D pose data [39,42,68] and video-based action understanding [13,50]. However, a barrier to using these tools in neuroscience is that the statistics of our neural data-the locations and sizes of cells-and behavioral data-body part lengths and limb ranges of motion-can be very different from animal to animal, creating a large domain gap.…”

Section: Related Workmentioning

confidence: 99%

“…Specifically, we trained our neural decoder f n along with the others without using any action labels. Then, freezing the neural encoder parameters, we trained a linear model on the encoded features, which is an evaluation protocol widely used in the field [10,13,25,39]. We used either half or all action labels.…”

Section: Benchmarksmentioning

confidence: 99%

Overcoming the Domain Gap in Neural Action Representations

Günel¹,

Aymanns²,

Honari³

et al. 2021

Preprint

View full text Add to dashboard Cite

Relating animal behaviors to brain activity is a fundamental goal in neuroscience, with practical applications in building robust brain-machine interfaces. However, the domain gap between individuals is a major issue that prevents the training of general models that work on unlabeled subjects.Since 3D pose data can now be reliably extracted from multi-view video sequences without manual intervention, we propose to use it to guide the encoding of neural action representations together with a set of neural and behavioral augmentations exploiting the properties of microscopy imaging. To reduce the domain gap, during training, we swap neural and behavioral data across animals that seem to be performing similar actions.To demonstrate this, we test our methods on three very different multimodal datasets; one that features flies and their neural activity, one that contains human neural Electrocorticography (ECoG) data, and lastly the RGB video data of human activities from different viewpoints.

show abstract

TCLR: Temporal contrastive learning for video representation

Cited by 86 publications

References 39 publications

Self-Supervised Learning for Videos: A Survey

Self-Supervised Learning for Videos: A Survey

A layer-wise fusion network incorporating self-supervised learning for multimodal MR image synthesis

Overcoming the Domain Gap in Neural Action Representations

Contact Info

Product

Resources

About