“…With the introduction of StyleGAN [14][15][16] mapping networks, the input random noise can be first mapped to another latent space that has disentangled semantics, then the model can generate images with better quality. Further, exploring the latent space of StyleGAN has been proved useful by several works [1,17,33] in text-driven image synthe- sis and manipulation tasks, where they utilize the pretrained vision-language model CLIP [37] model to manipulate pretrained unconditional StyleGAN networks. In order to relieve the need for paired text data during the training phase, Lafite [55] proposes to adopt the image CLIP embeddings as training input while using text CLIP embedding during the inference phase.…”