Dynamic Graph Attention for Referring Expression Comprehension

Yang, Sibei; Li, Guanbin; Yu, Yizhou

doi:10.1109/iccv.2019.00474

Cited by 183 publications

(86 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Liu et al [15] develop a neural module tree network to regularize the visual grounding along the dependency parsing tree of the sentence. The works in [3,36,42] argue to learn the representations from expression and image regions in a stepwise manner, and perform multi-step reasoning for better matching performance. Wang et al [30] propose a graph-based language-guided attention network to highlight the inter-object and intra-object relationships that are closely relevant to the expression for better performance.…”

Section: Related Work 21 Referring Expression Comprehensionmentioning

confidence: 99%

Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge

Wang

Liu

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Conventional referring expression comprehension (REF) assumes people to query something from an image by describing its visual appearance and spatial location, but in practice, we often ask for an object by describing its affordance or other non-visual attributes, especially when we do not have a precise target. For example, sometimes we say 'Give me something to eat'. In this case, we need to use commonsense knowledge to identify the objects in the image. Unfortunately, there is no existing referring expression dataset reflecting this requirement, not to mention a model to tackle this challenge. In this paper, we collect a new referring expression dataset, called KB-Ref, containing 43k expressions on 16k images. In KB-Ref, to answer each expression (detect the target object referred by the expression), at least one piece of commonsense knowledge must be required. We then test state-of-the-art (SoTA) REF models on KB-Ref, finding that all of them present a large drop compared to their outstanding performance on general REF datasets. We also present an expression conditioned image and fact attention (ECIFA) network that extracts information from correlated image regions and commonsense knowledge facts. Our method leads to a significant improvement over SoTA REF models, although there is still a gap between this strong baseline and human performance. The dataset and baseline models are available at: https:// github.com/ wangpengnorman/ KB-Ref_dataset. CCS CONCEPTS • Information systems → Image search; • Computing methodologies → Reasoning about belief and knowledge; Matching.

show abstract

Section: Related Work 21 Referring Expression Comprehensionmentioning

confidence: 99%

Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge

Wang

Liu

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…The task of referring expression comprehension has attracted increasing attention in recent years, which expects to locate corresponding objects within an image based on input expressions. Previous referring expression comprehension methods [2, 6, 11, 16, 22, 25, 27, 28, 35, 41-43, 45, 47-51] can be mainly divided into two types, including proposal-region-based [2,6,11,22,25,27,28,[41][42][43][47][48][49][50][51] and grid-region-based methods [16,35,45] .…”

Section: Related Workmentioning

confidence: 99%

“…Most of proposal-region-based methods [2,22,[41][42][43]47, 50] are based on the "listener" strategy, which first combines language features with visual features of proposal regions, and then select the target region that best matches the input expression from these proposals. The proposal regions are typically extracted by a pretrained object detector (e.g., Faster R-CNN [34], Mask R-CNN [9] and others [4,17,[30][31][32]).…”

Section: Related Workmentioning

confidence: 99%

“…To align visual regions with the expression more powerful, methods in [22,47] proposed to decompose the expression into three components (subject, localization and relationship), and leveraged cross-modality attentions to focus on relevant regions. To adaptively grounding complex referring expressions, a dynamic graph attention network in [43] performed multi-step visual reasoning based on the multi-modal context relationship graph to identify the compound objects step by step. In several works [25,27,28,48], a "speaker" strategy first predicted expressions from the visual features of every object and then matched the predicted expressions with the input expression to find out the desired object.…”

Section: Related Workmentioning

confidence: 99%

“…To model the relationships between language and vision, existing methods [2, 16, 22, 25, 27, 28, 35, 41-43, 45, 47-50] usually combine the language features and the regular image region features, such as object proposal regions [2,22,25,27,28,[41][42][43][47][48][49][50] and grid regions [16,35,45], as shown in Figure 1 (a) and (b), respectively. However, these methods ignore some fine-grained object information related to the natural language, such as object shapes and poses, which are often described in language expressions and important in referring expression comprehension to localize and distinguish the target objects.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension

Qiu

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Referring expression comprehension expects to accurately locate an object described by a language expression, which requires precise language-aware visual object representations. However, existing methods usually use rectangular object representations, such as object proposal regions and grid regions. They ignore some finegrained object information like shapes and poses, which are often described in language expressions and important to localize objects. Additionally, rectangular object regions usually contain background contents and irrelevant foreground features, which also decrease the localization performance. To address these problems, we propose a language-aware deformable convolution model (LDC) to learn language-aware fine-grained object representations. Rather than extracting rectangular object representations, LDC adaptively samples a set of key points based on the image and language to represent objects. This type of object representations can capture more fine-grained object information (e.g., shapes and poses) and suppress noises in accordance with language and thus, boosts the object localization performance. Based on the language-aware finegrained object representation, we next design a bidirectional interaction model (BIM) that leverages a modified co-attention mechanism to build cross-modal bidirectional interactions to further improve the language and object representations. Furthermore, we propose a hierarchical fine-grained representation network (HFRN) to learn language-aware fine-grained object representations and cross-modal bidirectional interactions at local word level and global sentence level, respectively. Our proposed method outperforms the state-of-the-art methods on the RefCOCO, RefCOCO+ and Ref-COCOg datasets.

show abstract

Propagating Over Phrase Relations for One-Stage Visual Grounding

Yang

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Dynamic Graph Attention for Referring Expression Comprehension

Cited by 183 publications

References 26 publications

Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge

Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge

Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension

Propagating Over Phrase Relations for One-Stage Visual Grounding

Contact Info

Product

Resources

About