2021
DOI: 10.48550/arxiv.2112.12182
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Fine-grained Multi-Modal Self-Supervised Learning

Abstract: Multi-Modal Self-Supervised Learning from videos has been shown to improve model's performance on various downstream tasks. However, such Self-Supervised pretraining requires large batch sizes and a large amount of computation resources due to the noise present in the uncurated data. This is partly due to the fact that the prevalent training scheme is trained on coarse-grained setting, in which vectors representing the whole video clips or natural language sentences are used for computing similarity. Such sche… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 23 publications
0
1
0
Order By: Relevance
“…It is suboptimal for generative tasks or dense prediction tasks. In computer vision, some studies perform contrastive learning at pixel-level to learn finer input representations [40]- [42]. In the ST task, we need to conduct fine granularity contrastive at frame-level so that each speech frame in the encoder has precise semantics.…”
Section: Related Workmentioning
confidence: 99%
“…It is suboptimal for generative tasks or dense prediction tasks. In computer vision, some studies perform contrastive learning at pixel-level to learn finer input representations [40]- [42]. In the ST task, we need to conduct fine granularity contrastive at frame-level so that each speech frame in the encoder has precise semantics.…”
Section: Related Workmentioning
confidence: 99%