“…To enable this, the core topic is how to define new classes, where using natural language is a promising direction. Pre-trained on large-scale image-text datasets [5], the language-driven or text-based visual recognition has been extended from image classification [45,27] to object detection [7,61,21,34] and semantic segmentation [19,60,32,59]. The task is also related to other cross-modal recognition topics, including image captioning [55,25], visual question answering [1,56,12,20], referring expression localization [26,41,67,39], visual reasoning [62], etc.…”