Sound-Guided Semantic Image Manipulation

Lee, Seung Hyun; Roh, Wonseok; Byeon, Wonmin; Yoon, Sang Ho; Kim, Chan Young; Kim, Jinkyu; Kim, Sangpil

doi:10.1109/cvpr52688.2022.00337

Cited by 22 publications

(17 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Huang, Patrick, et al, 2021), sketch-based image retrieval (Jing et al, 2022), code search (D. Guo, Lu, Duan, et al, 2022), visual question answering (Z. , event detection (Elhoseiny et al, 2016;S. Wu et al, 2014), visual grounding (Tziafas & Kasaei, 2021), natural language grounding (Sinha et al, 2019), semantic image manipulation (S. H. Lee et al, 2022), medical image segmentation (Bian et al, 2022), video object segmentation (Zhao et al, 2021), sign language recognition (Madapana, 2020), tactile object recognition (H. Liu et al, 2018), and driver behavior recognition (Reiß et al, 2020). More cases can refer to this survey (Cao et al, 2020).…”

Section: Benchmark Datasetsmentioning

confidence: 99%

“…Instead of an introduction that concentrates on the applications themselves, several datasets available in various scenarios are offered to readers as guidelines, such as cross‐modal classification and retrieval (Geigle et al, 2022; Mercea et al, 2022; Parida et al, 2020; Shvetsova et al, 2022; Wray et al, 2019), cross‐lingual retrieval (P.‐Y. Huang, Patrick, et al, 2021), sketch‐based image retrieval (Jing et al, 2022), code search (D. Guo, Lu, Duan, et al, 2022), visual question answering (Z. Chen, Chen, et al, 2021), event detection (Elhoseiny et al, 2016; S. Wu et al, 2014), visual grounding (Tziafas & Kasaei, 2021), natural language grounding (Sinha et al, 2019), semantic image manipulation (S. H. Lee et al, 2022), medical image segmentation (Bian et al, 2022), video object segmentation (Zhao et al, 2021), sign language recognition (Madapana, 2020), tactile object recognition (H. Liu et al, 2018), and driver behavior recognition (Reiß et al, 2020). More cases can refer to this survey (Cao et al, 2020).…”

Section: Model Evaluation Metrics and Datasets For Mzslmentioning

confidence: 99%

“…A typical ZSL is a multimodal learning protocol linking visual and semantic space. As MZSL is applied in more disciplines, other associated modalities, such as audio (Akbari et al, 2021; S. H. Lee et al, 2022; Mazumder et al, 2021), video (Elhoseiny et al, 2016; Parida et al, 2020; Shvetsova et al, 2022; Xu et al, 2021; Yang et al, 2022), graph (T. Guo, Yu, Aloqaily, & Wan, 2022), and other physical or physiological time series signals (M. Li, Xie, et al, 2020), continue to evolve (Cao et al, 2020). In essence, the multimodal version of ZSL poses an issue for non‐parallel data, especially those with missing modalities or restricted resources (Dai et al, 2020; D. Guo, Lu, Duan, et al, 2022), since the intersection of seen classes varies between modalities (Baltrušaitis et al, 2018).…”

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

A review on multimodal zero‐shot learning

Cao

Sun

et al. 2023

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Multimodal learning provides a path to fully utilize all types of information related to the modeling target to provide the model with a global vision. Zero-shot learning (ZSL) is a general solution for incorporating prior knowledge into data-driven models and achieving accurate class identification. The combination of the two, known as multimodal ZSL (MZSL), can fully exploit the advantages of both technologies and is expected to produce models with greater generalization ability. However, the MZSL algorithms and applicationshave not yet been thoroughly investigated and summarized. This study fills this gap by providing an objective overview of MZSL's definition, typical algorithms, representative applications, and critical issues. This article will not only provide researchers in this field with a comprehensive perspective, but it will also highlight several promising research directions.

show abstract

Section: Benchmark Datasetsmentioning

confidence: 99%

Section: Model Evaluation Metrics and Datasets For Mzslmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

A review on multimodal zero‐shot learning

Cao

Sun

et al. 2023

WIREs Data Min & Knowl

View full text Add to dashboard Cite

show abstract

“…Recently, conditional information in other modalities, such text [198]- [202] and speech [203]- [205], has attracted increasing research attention due to the development of pre-trained large-scale frameworks (e.g., CLIP [206]) and availability of related datasets (CelebA-Dialog [207]). Moreover, novel modalities of supervision signal, such as biometrics (e.g., brain responses recorded via electroencephalography [208]) and sound [209], have also been utilized to learn feature representations for semantic editing.…”

Section: Challenges and Future Directionsmentioning

confidence: 99%

GAN-Based Face Attribute Editing

Liu

Cao

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Recently, a variety of methods using the Generative Adversarial Network (GAN) for face editing have been proposed. However, the existing methods cannot control the editing content of the face elements according to the user-specified attributes or need to train a conditional GAN for editing tasks, which means it is difficult to add new attributes in the future. In this paper, a method to edit face attributes by editing the latent variable with the help of a pre-trained unconditional GAN and a linear classification model is proposed. In particular, face attribute editing is divided into two separate stages: Firstly, based on the optimization function, the generative model does a latent variable search to generate a high-quality face image that is similar to the input image. Secondly, by editing the latent variable of the GAN, the attribute of the generated face image can be modified indirectly, so it is almost unaffected by the training process and network structure of GAN, which means it is a flexible method for nearly all GAN network. Images of the FFHQ dataset are edited by attribute labels defined in Celeba dataset for experiments. These experiments prove that our method can edit a variety of face images that vary with race, gender, age, and camera shooting angle. The overall quality of the edited image is not inferior to the other face attribute editing method, and attribute classification for edited image shows a 92.6% attribute editing success rate of the proposed method.

show abstract