Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Chen, Brian; Rouditchenko, Andrew; Duarte, Kevin; Kuehne, Hilde; Thomas, Samuel; Boggust, Angie; Panda, Rameswar; Kingsbury, Brian; Feris, Rogério; Harwath, David; Glass, James; Picheny, Michael; Chang, Shih-Fu

doi:10.1109/iccv48922.2021.00791

Cited by 48 publications

(19 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Aside from self-supervised learning, jointly modeling audio-visual data also improves audio-visual event classification performance [23,17,22,53]. Additionally, due to the general design of transformer architectures [54,55], transformerbased methods [56,57,58,59,60,61] provide a flexible framework for jointly modeling audio and video data. Compared to these prior approaches, which are predominantly focused on audio-visual event classification or self-supervised representation learning, our approach focuses on efficient long-range text-to-video retrieval.…”

Section: Related Workmentioning

confidence: 99%

ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

Lin¹,

Liu²,

Bansal³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce an audiovisual method for long-range text-to-video retrieval. Unlike previous approaches designed for short video retrieval (e.g., 5-15 seconds in duration), our approach aims to retrieve minute-long videos that capture complex human actions. One challenge of standard video-only approaches is the large computational cost associated with processing hundreds of densely extracted frames from such long videos. To address this issue, we propose to replace parts of the video with compact audio cues that succinctly summarize dynamic audio events and are cheap to process. Our method, named ECLIPSE (Efficient CLIP with Sound Encoding), adapts the popular CLIP model to an audiovisual video setting, by adding a unified audiovisual transformer block that captures complementary cues from the video and audio streams. In addition to being 2.92× faster and 2.34× memory-efficient than long-range video-only approaches, our method also achieves better text-to-video retrieval accuracy on several diverse long-range video datasets such as ActivityNet, QVHighlights, YouCook2, DiDeMo and Charades.

show abstract

Section: Related Workmentioning

confidence: 99%

ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

Lin¹,

Liu²,

Bansal³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In view of the multi-modality of videos, many works explore mutual supervision across modalities to learn representations of each modality. For example, they regard temporal or semantic consistency between videos and audios [8,28] or narrations [1,4,35,36] as a natural source of supervision. MIL-NCE [35] introduced contrastive learning to learn joint embeddings between clips and captions of unlabeled and uncurated narrated videos.…”

Section: Related Workmentioning

confidence: 99%

“…Then, ri,j is calculated with smoothed δi,j and ri−1,j, ri,j−1, ri−1,j−1 by(6). At backward propagation, µi,j is calculated by(8). It gains the gradient from three directions proportional to how optimal the cumulative cost r of each direction is.…”

mentioning

confidence: 99%

Video-Text Representation Learning via Differentiable Weak Temporal Alignment

Ko¹,

Choi²,

Ko³

et al. 2022

Preprint

View full text Add to dashboard Cite

Learning generic joint representations for video and text by a supervised method requires a prohibitively substantial amount of manually annotated video datasets. As a practical alternative, a large-scale but uncurated and narrated video dataset, HowTo100M, has recently been introduced. But it is still challenging to learn joint embeddings of video and text in a self-supervised manner, due to its ambiguity and non-sequential alignment. In this paper, we propose a novel multi-modal self-supervised framework Video-Text Temporally Weak Alignment-based Contrastive Learning (VT-TWINS) to capture significant information from noisy and weakly correlated data using a variant of Dynamic Time Warping (DTW). We observe that the standard DTW inherently cannot handle weakly correlated data and only considers the globally optimal alignment path. To address these problems, we develop a differentiable DTW which also reflects local information with weak temporal alignment. Moreover, our proposed model applies a contrastive learning scheme to learn feature representations on weakly correlated data. Our extensive experiments demonstrate that VT-TWINS attains significant improvements in multi-modal representation learning and outperforms various challenging downstream tasks. Code is available at https://github.com/mlvlab/VT-TWINS.

show abstract

“…Therefore, K-means clustering is a prime candidate. Also, it is still highly popular among recent methods [75], [76], [77], [71], [78], [79], [80].…”

Section: Clustering Loss Blockmentioning

confidence: 99%

Clustering Aided Weakly Supervised Training to Detect Anomalous Events in Surveillance Videos

Zaheer¹,

Mahmood²,

Astrid³

et al. 2022

Preprint

View full text Add to dashboard Cite

Formulating learning systems for the detection of real-world anomalous events using only video-level labels is a challenging task mainly due to the presence of noisy labels as well as the rare occurrence of anomalous events in the training data. We propose a weakly supervised anomaly detection system which has multiple contributions including a random batch selection mechanism to reduce inter-batch correlation and a normalcy suppression block which learns to minimize anomaly scores over normal regions of a video by utilizing the overall information available in a training batch. In addition, a clustering loss block is proposed to mitigate the label noise and to improve the representation learning for the anomalous and normal regions. This block encourages the backbone network to produce two distinct feature clusters representing normal and anomalous events. Extensive analysis of the proposed approach is provided using three popular anomaly detection datasets including UCF-Crime, ShanghaiTech, and UCSD Ped2. The experiments demonstrate a superior anomaly detection capability of our approach.

show abstract

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Cited by 48 publications

References 19 publications

ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

Video-Text Representation Learning via Differentiable Weak Temporal Alignment

Clustering Aided Weakly Supervised Training to Detect Anomalous Events in Surveillance Videos

Contact Info

Product

Resources

About