2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00791
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
15
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 48 publications
(19 citation statements)
references
References 19 publications
0
15
0
Order By: Relevance
“…Aside from self-supervised learning, jointly modeling audio-visual data also improves audio-visual event classification performance [23,17,22,53]. Additionally, due to the general design of transformer architectures [54,55], transformerbased methods [56,57,58,59,60,61] provide a flexible framework for jointly modeling audio and video data. Compared to these prior approaches, which are predominantly focused on audio-visual event classification or self-supervised representation learning, our approach focuses on efficient long-range text-to-video retrieval.…”
Section: Related Workmentioning
confidence: 99%
“…Aside from self-supervised learning, jointly modeling audio-visual data also improves audio-visual event classification performance [23,17,22,53]. Additionally, due to the general design of transformer architectures [54,55], transformerbased methods [56,57,58,59,60,61] provide a flexible framework for jointly modeling audio and video data. Compared to these prior approaches, which are predominantly focused on audio-visual event classification or self-supervised representation learning, our approach focuses on efficient long-range text-to-video retrieval.…”
Section: Related Workmentioning
confidence: 99%
“…In view of the multi-modality of videos, many works explore mutual supervision across modalities to learn representations of each modality. For example, they regard temporal or semantic consistency between videos and audios [8,28] or narrations [1,4,35,36] as a natural source of supervision. MIL-NCE [35] introduced contrastive learning to learn joint embeddings between clips and captions of unlabeled and uncurated narrated videos.…”
Section: Related Workmentioning
confidence: 99%
“…Then, ri,j is calculated with smoothed δi,j and ri−1,j, ri,j−1, ri−1,j−1 by(6). At backward propagation, µi,j is calculated by(8). It gains the gradient from three directions proportional to how optimal the cumulative cost r of each direction is.…”
mentioning
confidence: 99%
“…Therefore, K-means clustering is a prime candidate. Also, it is still highly popular among recent methods [75], [76], [77], [71], [78], [79], [80].…”
Section: Clustering Loss Blockmentioning
confidence: 99%