2021
DOI: 10.48550/arxiv.2109.04699
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

Jue Wang,
Haofan Wang,
Jincan Deng
et al.

Abstract: While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In this work, we propose an EfficientCLIP m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
6
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(6 citation statements)
references
References 31 publications
0
6
0
Order By: Relevance
“…[7] proposes a distillation based loss to better handle the noise in the dataset. Efficient-CLIP [44] is a concurrent work that uses text-only data to perform unimodal MLM task. DeCLIP [1] is another concurrent work that uses multiple unimodal SSL losses in addition to removing the noise from negatives with the help of nearest neighbors [11,22].…”
Section: Related Workmentioning
confidence: 99%
“…[7] proposes a distillation based loss to better handle the noise in the dataset. Efficient-CLIP [44] is a concurrent work that uses text-only data to perform unimodal MLM task. DeCLIP [1] is another concurrent work that uses multiple unimodal SSL losses in addition to removing the noise from negatives with the help of nearest neighbors [11,22].…”
Section: Related Workmentioning
confidence: 99%
“…Vision-Language Pretraining: Joint vision-language pretraining (VLP) is an active research area [1,19,39,63,68] where the availability of large-scale image-text datasets e.g., YFCC100M [71] and Conceptual Captions [9,67] has played a key role in its progress. Although multiple concurrent works are being proposed to further improve VLP models [75], our work is different from them in a few important ways. Specifically, unlike EfficientCLIP [75] that proposes an ensemble approach to obtain a less noisy data subset for cross-modal training, our method attempts to sidestep this problem altogether by re-purposing as opposed to completely removing noisy data.…”
Section: Related Workmentioning
confidence: 99%
“…Although multiple concurrent works are being proposed to further improve VLP models [75], our work is different from them in a few important ways. Specifically, unlike EfficientCLIP [75] that proposes an ensemble approach to obtain a less noisy data subset for cross-modal training, our method attempts to sidestep this problem altogether by re-purposing as opposed to completely removing noisy data. Similarly, DeCLIP [39] improves on the data-efficiency of CLIP [63] by leveraging intra-model contrastive learning along with a nearestneighbor feature bank to augment negatives.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations