“…Recently, denoising diffusion models [17] have been proposed for image synthesis, which can generate high-quality images by iteratively denoising a noise image toward a clean image [10,36,37,39]. Diffusion models can also be conditioned on various inputs such as depth-map [48], sketch [43], semantic segmentation [1,37], text [30,36,37,39] and others [18,22,48]. For text conditioning, these models take advantage of pre-trained large language models (LLMs) [34,35] in order to generate images that are aligned with a natural language text prompt given by the user.…”