“…To model the relationships between language and vision, existing methods [2, 16, 22, 25, 27, 28, 35, 41-43, 45, 47-50] usually combine the language features and the regular image region features, such as object proposal regions [2,22,25,27,28,[41][42][43][47][48][49][50] and grid regions [16,35,45], as shown in Figure 1 (a) and (b), respectively. However, these methods ignore some fine-grained object information related to the natural language, such as object shapes and poses, which are often described in language expressions and important in referring expression comprehension to localize and distinguish the target objects.…”