Proceedings of the 30th ACM International Conference on Multimedia 2022
DOI: 10.1145/3503161.3547881
|View full text |Cite
|
Sign up to set email alerts
|

DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation

Abstract: Text-to-image generation aims at generating realistic images which are semantically consistent with the given text. Previous works mainly adopt the multi-stage architecture by stacking generatordiscriminator pairs to engage multiple adversarial training, where the text semantics used to provide generation guidance remain static across all stages. This work argues that text features at each stage should be adaptively re-composed conditioned on the status of the historical stage (i.e., historical stage's text an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 10 publications
(2 citation statements)
references
References 43 publications
(97 reference statements)
0
2
0
Order By: Relevance
“…Early works mainly adopted GAN (Goodfellow et al 2014) as the foundational generative model for this task. Various works have been proposed (Zhang et al 2021;Zhu et al 2019;Xu et al 2018;Zhang et al 2017;Zhang, Xie, and Yang 2018;Liang, Pei, and Lu 2020;Cheng et al 2020;Ruan et al 2021;Tao et al 2020;Li et al 2019;Huang et al 2022) with well-designed textual representations, elegant text-image interactions, and effective loss functions. However, GAN-based models often suffer from training instability and model collapse, making it hard to be trained on largescale datasets (Brock, Donahue, and Simonyan 2018;Kang et al 2023;Schuhmann et al 2021).…”
Section: Related Work Text-to-image Generationmentioning
confidence: 99%
“…Early works mainly adopted GAN (Goodfellow et al 2014) as the foundational generative model for this task. Various works have been proposed (Zhang et al 2021;Zhu et al 2019;Xu et al 2018;Zhang et al 2017;Zhang, Xie, and Yang 2018;Liang, Pei, and Lu 2020;Cheng et al 2020;Ruan et al 2021;Tao et al 2020;Li et al 2019;Huang et al 2022) with well-designed textual representations, elegant text-image interactions, and effective loss functions. However, GAN-based models often suffer from training instability and model collapse, making it hard to be trained on largescale datasets (Brock, Donahue, and Simonyan 2018;Kang et al 2023;Schuhmann et al 2021).…”
Section: Related Work Text-to-image Generationmentioning
confidence: 99%
“…Then, the top-π‘˜ measures, such as P@π‘˜, mAP@π‘˜, and NDCG@π‘˜, will not drop significantly while the specified label has been ranked far behind the top-π‘˜ position. In the extreme case, the classifier can never detect certain important categories and produce degraded classification performance, which may result in serious consequences for multimedia applications, such as multimodal emotion recognition [35] and text-to-image generation [20].…”
Section: Introductionmentioning
confidence: 99%