“…This focus makes our work closely related the recent work of image-text models such as CLIP [64] and ALIGN [37], both of which have shown zero-shot transfer ability for image classification. Similar to CLIP and ALIGN, our work also learns the mapping between images and texts, which is related to many previous works, such as [2,3,8,13,22,24,32,33,36,42,51,52,54,56,57,60,[71][72][73]91].…”