Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Zhang, Lin; Yu, Sherry H.; Kuang, Zhiyi; Pathak, Deepak; Ramanan, Deva

doi:10.48550/arxiv.2301.06267

Cited by 2 publications

(1 citation statement)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our proposed Treff adapter outperforms the TIP-adapter by 0.71 percentage points in terms of accuracy, indicating that CALM learns task-specific knowledge while preserving the knowledge from CLAP. Table 3 compares the proposed Treff-adapter with other cross-modality few-shot methods on the ImageNet-ESC [12]. It can be observed that the Treff adapter and the TIP-adapter outperform the cross-modality few-shot learning by a large margin as they are able to make use of zero-shot knowledge transferring explicitly while the crossmodality FSL discards it gradually in the parameter optimisation.…”

Section: Model Esc-50 Fsdkaggle18kmentioning

confidence: 99%

Interspeech 2023

2023

View full text Add to dashboard Cite

Contrastive language-audio pretraining (CLAP) has become a new paradigm to learn audio concepts with audio-text pairs. CLAP models have shown unprecedented performance as zeroshot classifiers on downstream tasks. To further adapt CLAP with domain-specific knowledge, a popular method is to finetune its audio encoder with available labelled examples. However, this is challenging in low-shot scenarios, as the amount of annotations is limited compared to the model size. In this work, we introduce a Training-efficient (Treff) adapter to rapidly learn with a small set of examples while maintaining the capacity for zero-shot classification. First, we propose a crossattention linear model (CALM) to map a set of labelled examples and test audio to test labels. Second, we find initialising CALM as a cosine measurement improves our Treff adapter even without training. The Treff adapter outperforms metricbased methods in few-shot settings and yields competitive results to fully-supervised methods.

show abstract

Section: Model Esc-50 Fsdkaggle18kmentioning

confidence: 99%