Sound-Guided Semantic Image Manipulation

Lee, Seung Hyun; Roh, Wonseok; Byeon, Wonmin; Yoon, Sang Ho; Kim, Chan Young; Kim, Jinkyu; Kim, Sangpil

doi:10.48550/arxiv.2112.00007

Cited by 1 publication

(7 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CLIP-based Sound Representation Learning. We use the VGG-Sound [4] dataset to create Lee's [24] audio-visual embedding space. VGG-Sound is a largescale audio-visual dataset including more than 310 classes with over 200,000 video clips.…”

Section: Methodsmentioning

confidence: 99%

“…CLIP [28] learned the relationship between image and text embedding by multimodal self-supervised learning of 400 million image-text pairs and showed zero-shot inference performance comparable to supervised learning in most image-text benchmark datasets. Recent studies [11,24,44] extend the modalities of CLIP to audio. Lee et al [24] especially focused on audio-visual representation learning for image editing, and we also leverage that audio-visual multimodal space embedding for navigating the latent code.…”

Section: Initial Latentmentioning

confidence: 99%

“…Recent studies [11,24,44] extend the modalities of CLIP to audio. Lee et al [24] especially focused on audio-visual representation learning for image editing, and we also leverage that audio-visual multimodal space embedding for navigating the latent code.…”

Section: Initial Latentmentioning

confidence: 99%

“…Matching Multimodal Semantics via CLIP Space. Lee et al [24] introduced an extended CLIP-based multi-modal (image, text, and audio) feature space, which is trained to produce a joint embedding space where a positive triplet pair (e.g., audio input: "thunderstorms", text: "thunderstorm", and corresponding image) are mapped close together in the CLIP-based embedding space, while pushing that of negative pair samples further away. We utilize this pre-trained CLIP-based embedding space to generate images that are semantically well-aligned with the sound input.…”

Section: Sound-guided Video Generationmentioning

confidence: 99%

“…StyleCLIP [27] and TediGAN [45] leverage the representational power of CLIP for text-driven image editing. Recently, the prior knowledge of CLIP has been transferred to the audio-visual relationship, making it possible to develop various applications using CLIP in sound modality [44,24]. In this paper, we exploit the prior knowledge of audio-visual to generate a complete video related to sound.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Sound-Guided Semantic Video Generation

Lee¹,

Oh²,

Byeon³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the corresponding sound and generates a video in a hierarchical manner. We provide the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task. The experiments show that our model outperforms the state-of-the-art methods in terms of video quality. We further show several applications including image and video editing to verify the effectiveness of our method.

show abstract

Section: Methodsmentioning

confidence: 99%