2021
DOI: 10.48550/arxiv.2112.00007
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Sound-Guided Semantic Image Manipulation

Abstract: Fire cracklingUnderwater bubbling SirenWind noise Giggling SobbingInput image Mel-spectrogramAcoustic features Manipulated imageFire crackling Nose blow

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(7 citation statements)
references
References 37 publications
0
7
0
Order By: Relevance
“…CLIP-based Sound Representation Learning. We use the VGG-Sound [4] dataset to create Lee's [24] audio-visual embedding space. VGG-Sound is a largescale audio-visual dataset including more than 310 classes with over 200,000 video clips.…”
Section: Methodsmentioning
confidence: 99%
See 4 more Smart Citations
“…CLIP-based Sound Representation Learning. We use the VGG-Sound [4] dataset to create Lee's [24] audio-visual embedding space. VGG-Sound is a largescale audio-visual dataset including more than 310 classes with over 200,000 video clips.…”
Section: Methodsmentioning
confidence: 99%
“…CLIP [28] learned the relationship between image and text embedding by multimodal self-supervised learning of 400 million image-text pairs and showed zero-shot inference performance comparable to supervised learning in most image-text benchmark datasets. Recent studies [11,24,44] extend the modalities of CLIP to audio. Lee et al [24] especially focused on audio-visual representation learning for image editing, and we also leverage that audio-visual multimodal space embedding for navigating the latent code.…”
Section: Initial Latentmentioning
confidence: 99%
See 3 more Smart Citations