Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Ghasemipour, Seyed Kamyar Seyed; Ayan, Burcu Karagol; Mahdavi, Sara; Lopes, Rapha Gontijo; Salimans, Tim; Ho, Jonathan; Fleet, David J.; Norouzi, Mohammad

doi:10.48550/arxiv.2205.11487

Cited by 191 publications

(316 citation statements)

References 35 publications

Supporting

Mentioning

312

Contrasting

Order By: Relevance

“…Conditional image generation is a task to synthesize an intended image provided that conditional information is given. The conditional information could be an image [11], [13], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], a categorical label [10], [27], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], or a text [85], [86], [87], [88], [89], [90]. With GAN's surging popularity, conditional GAN (cGAN) has become a representative framework for conditional image generation, and the cGANs specialized in each conditional information are being studied as separate research topics.…”

Section: Conditional Image Generationmentioning

confidence: 99%

StudioGAN: A Taxonomy and Benchmark of GANs for Image Synthesis

Kang¹,

Shin²,

Park³

2022

Preprint

View full text Add to dashboard Cite

Generative Adversarial Network (GAN) is one of the state-of-the-art generative models for realistic image synthesis. While training and evaluating GAN becomes increasingly important, the current GAN research ecosystem does not provide reliable benchmarks for which the evaluation is conducted consistently and fairly. Furthermore, because there are few validated GAN implementations, researchers devote considerable time to reproducing baselines. We study the taxonomy of GAN approaches and present a new open-source library named StudioGAN. StudioGAN supports 7 GAN architectures, 9 conditioning methods, 4 adversarial losses, 13 regularization modules, 3 differentiable augmentations, 7 evaluation metrics, and 5 evaluation backbones. With our training and evaluation protocol, we present a large-scale benchmark using various datasets (CIFAR10, ImageNet, AFHQv2, FFHQ, and Baby/Papa/Granpa-ImageNet) and 3 different evaluation backbones (InceptionV3, SwAV, and Swin Transformer). Unlike other benchmarks used in the GAN community, we train representative GANs, including BigGAN, StyleGAN2, and StyleGAN3, in a unified training pipeline and quantify generation performance with 7 evaluation metrics. The benchmark evaluates other cutting-edge generative models (e.g., StyleGAN-XL, ADM, MaskGIT, and RQ-Transformer). StudioGAN provides GAN implementations, training, and evaluation scripts with the pre-trained weights. StudioGAN is available at https://github.com/POSTECH-CVLab/PyTorch-StudioGAN.

show abstract

Section: Conditional Image Generationmentioning

confidence: 99%

StudioGAN: A Taxonomy and Benchmark of GANs for Image Synthesis

Kang¹,

Shin²,

Park³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Fine-Grained Stylization with ArtBench Many powerful models emulate text-driven stylization by adding the postfix "... in the style of ..." to a given prompt [22,19,21,23,30]. By using style specific databases obtained from the Art-Bench dataset [16] during inference, we here present an alternative approach.…”

Section: Recap On Retrieval-augmented Diffusion Modelsmentioning

confidence: 99%

“…Diffusion models have recently set the state of the art in image generation and controllable synthesis [9,22]. In text-to-image synthesis in particular, we have seen impressive results [21,23] that can also be used to create artistic images. Such models thus have the potential to help artists create new content and have contributed to the tremendous growth of the field of AI generated art [7].…”

mentioning

confidence: 99%

Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models

Rombach¹,

Blattmann²,

Ommer³

2022

Preprint

View full text Add to dashboard Cite

Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Of particular note is the field of "AI-Art", which has seen unprecedented growth with the emergence of powerful multimodal models such as CLIP. By combining speech and image synthesis models, so-called "prompt-engineering" has become established, in which carefully selected and composed sentences are used to achieve a certain visual style in the synthesized image. In this note, we present an alternative approach based on retrievalaugmented diffusion models (RDMs). In RDMs, a set of nearest neighbors is retrieved from an external database during training for each training instance, and the diffusion model is conditioned on these informative samples. During inference (sampling), we replace the retrieval database with a more specialized database that contains, for example, only images of a particular visual style. This provides a novel way to "prompt" a general trained model after training and thereby specify a particular visual style. As shown by our experiments, this approach is superior to specifying the visual style within the text prompt. We open-source code and model weights at https://github.com/CompVis/latent-diffusion.

show abstract

“…While there have been various standardized image synthesis benchmarks such as CIFAR-10 [26], ImageNet [8], and LSUN [57], they are biased towards photos of real scenes or objects rather than artwork. Recent works like DALLE 2 [39], GLIDE [37], and Imagen [42] datasets have demonstrated the potential of generative models to generate artworks. However, those models are trained on a mixture of photos and artworks of various styles, making it difficult to conduct analysis and evaluations on artwork generation of particular artistic styles.…”

Section: Mark Rothkomentioning

confidence: 99%

The ArtBench Dataset: Benchmarking Generative Models with Artworks

Liao¹,

Li²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation. It comprises 60,000 images of artwork from 10 distinctive artistic styles, with 5,000 training images and 1,000 testing images per style. ArtBench-10 has several advantages over previous artwork datasets. Firstly, it is class-balanced while most previous artwork datasets suffer from the long tail class distributions. Secondly, the images are of high quality with clean annotations. Thirdly, ArtBench-10 is created with standardized data collection, annotation, filtering, and preprocessing procedures. We provide three versions of the dataset with different resolutions (32 × 32, 256 × 256, and original image size), formatted in a way that is easy to be incorporated by popular machine learning frameworks. We also conduct extensive benchmarking experiments using representative image synthesis models with ArtBench-10 and present in-depth analysis. The dataset is available at https://github.com/ liaopeiyuan/artbench under a Fair Use license. * Equal contribution.Preprint. Under review.

show abstract

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Cited by 191 publications

References 35 publications

StudioGAN: A Taxonomy and Benchmark of GANs for Image Synthesis

StudioGAN: A Taxonomy and Benchmark of GANs for Image Synthesis

Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models

The ArtBench Dataset: Benchmarking Generative Models with Artworks

Contact Info

Product

Resources

About