“…Language and vision already interact in simple tasks such as object classification, where images are mapped to concepts in a closed vocabulary of categories. However, multimodal representations [4] allow for richer interactions enabling cross-modal tasks such as cross-modal retrieval [11,63,9,66,13], image captioning [18,12,49], visual question answering [47,23,10,65], and more recently text-to-image synthesis [32,75]. Lan- beyond the limited categories seen during training by projecting to language spaces, also known as zero-shot recognition [15,71].…”