Audioclip: Extending Clip to Image, Text and Audio

Guzhov, Andrey; Raue, Federico; Hees, J.J. van; Dengel, Andreas

doi:10.1109/icassp43922.2022.9747631

Cited by 121 publications

(58 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Also, the style prompt tuning positively contributes to ESPER 's performance, increasing CIDEr by 4.86. Wav2CLIP (and preliminary experiments with other audio encoders, specifically, Guzhov et al (2022); Wu et al (2022) which are also pretrained on an audio classifica-tion dataset (Gemmeke et al, 2017;Chen et al, 2020a)) appears to provide less accurate training signal for ESPER compared to image CLIP pretrained on large-scale image caption dataset (Radford et al, 2021). We expect this is the case not only because audio classification datasets are relatively small (Zhao et al, 2021) but also because these datasets do not offer rich natural language annotations.…”

Section: Evaluation Of Auditory Alignmentmentioning

confidence: 99%

Multimodal Knowledge Alignment with Reinforcement Learning

Yu¹,

Chung²,

Yun³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large language models readily adapt to novel settings, even without task-specific training data. Can their zero-shot capacity be extended to multimodal inputs? In this work, we propose ESPER (ExtraSensory PErception with Reinforcement learning) which extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision: for example, in the image case our reward optimization relies only on cosine similarity derived from CLIP (Radford et al., 2021), and thus requires no additional explicitly paired (image, caption) data. Because the parameters of the language model are left unchanged, the model maintains its capacity for zero-shot generalization. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks; these include a new benchmark we collect and release, ESP dataset, which tasks models with generating several diversely-styled captions for each image.

show abstract

Section: Evaluation Of Auditory Alignmentmentioning

confidence: 99%

Multimodal Knowledge Alignment with Reinforcement Learning

Yu¹,

Chung²,

Yun³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…CLIP [28] learned the relationship between image and text embedding by multimodal self-supervised learning of 400 million image-text pairs and showed zero-shot inference performance comparable to supervised learning in most image-text benchmark datasets. Recent studies [11,24,44] extend the modalities of CLIP to audio. Lee et al [24] especially focused on audio-visual representation learning for image editing, and we also leverage that audio-visual multimodal space embedding for navigating the latent code.…”

Section: Initial Latentmentioning

confidence: 99%

Sound-Guided Semantic Video Generation

Lee¹,

Oh²,

Byeon³

et al. 2022

Preprint

View full text Add to dashboard Cite

The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the corresponding sound and generates a video in a hierarchical manner. We provide the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task. The experiments show that our model outperforms the state-of-the-art methods in terms of video quality. We further show several applications including image and video editing to verify the effectiveness of our method.

show abstract

“…Over time, the focus shifted to a different form of using contrastive loss to learn more informative multimodal embedding spaces. There has been several variations of contrastive loss [10,11,12]. In this work, we use the contrastive objective from CLIP [10,11].…”

Section: Contrastive Lossmentioning

confidence: 99%

“…There has been several variations of contrastive loss [10,11,12]. In this work, we use the contrastive objective from CLIP [10,11].…”

Section: Contrastive Lossmentioning

confidence: 99%

See 1 more Smart Citation

Language-Based Audio Retrieval with Converging Tied Layers and Contrastive Loss

Koh¹,

Chng²

2022

Preprint

View full text Add to dashboard Cite

In this paper, we tackle the new Language-Based Audio Retrieval task proposed in DCASE 2022 1 . Firstly, we introduce a simple, scalable architecture which ties both the audio and text encoder together. Secondly, we show that using this architecture along with contrastive loss allows the model to significantly beat the performance of the baseline model. Finally, in addition to having an extremely low training memory requirement, we are able to use pretrained models as it is without needing to finetune them. We test our methods and show that using a combination of our methods beats the baseline scores significantly.

show abstract

Audioclip: Extending Clip to Image, Text and Audio

Cited by 121 publications

References 16 publications

Multimodal Knowledge Alignment with Reinforcement Learning

Multimodal Knowledge Alignment with Reinforcement Learning

Sound-Guided Semantic Video Generation

Language-Based Audio Retrieval with Converging Tied Layers and Contrastive Loss

Contact Info

Product

Resources

About