ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747631
|View full text |Cite
|
Sign up to set email alerts
|

Audioclip: Extending Clip to Image, Text and Audio

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
44
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 121 publications
(58 citation statements)
references
References 16 publications
0
44
0
Order By: Relevance
“…Also, the style prompt tuning positively contributes to ESPER 's performance, increasing CIDEr by 4.86. Wav2CLIP (and preliminary experiments with other audio encoders, specifically, Guzhov et al (2022); Wu et al (2022) which are also pretrained on an audio classifica-tion dataset (Gemmeke et al, 2017;Chen et al, 2020a)) appears to provide less accurate training signal for ESPER compared to image CLIP pretrained on large-scale image caption dataset (Radford et al, 2021). We expect this is the case not only because audio classification datasets are relatively small (Zhao et al, 2021) but also because these datasets do not offer rich natural language annotations.…”
Section: Evaluation Of Auditory Alignmentmentioning
confidence: 99%
“…Also, the style prompt tuning positively contributes to ESPER 's performance, increasing CIDEr by 4.86. Wav2CLIP (and preliminary experiments with other audio encoders, specifically, Guzhov et al (2022); Wu et al (2022) which are also pretrained on an audio classifica-tion dataset (Gemmeke et al, 2017;Chen et al, 2020a)) appears to provide less accurate training signal for ESPER compared to image CLIP pretrained on large-scale image caption dataset (Radford et al, 2021). We expect this is the case not only because audio classification datasets are relatively small (Zhao et al, 2021) but also because these datasets do not offer rich natural language annotations.…”
Section: Evaluation Of Auditory Alignmentmentioning
confidence: 99%
“…CLIP [28] learned the relationship between image and text embedding by multimodal self-supervised learning of 400 million image-text pairs and showed zero-shot inference performance comparable to supervised learning in most image-text benchmark datasets. Recent studies [11,24,44] extend the modalities of CLIP to audio. Lee et al [24] especially focused on audio-visual representation learning for image editing, and we also leverage that audio-visual multimodal space embedding for navigating the latent code.…”
Section: Initial Latentmentioning
confidence: 99%
“…Over time, the focus shifted to a different form of using contrastive loss to learn more informative multimodal embedding spaces. There has been several variations of contrastive loss [10,11,12]. In this work, we use the contrastive objective from CLIP [10,11].…”
Section: Contrastive Lossmentioning
confidence: 99%
“…There has been several variations of contrastive loss [10,11,12]. In this work, we use the contrastive objective from CLIP [10,11].…”
Section: Contrastive Lossmentioning
confidence: 99%
See 1 more Smart Citation