Visual Reasoning with Multi-hop Feature Modulation

Strub, Florian; Seurin, Mathieu; Perez, Ethan; Vries, Harm de; Mary, Jérémie; Preux, Philippe; Courville, Aaron; Pietquin, Olivier

doi:10.1007/978-3-030-01228-1_48

Cited by 22 publications

(15 citation statements)

References 36 publications

(65 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Referring expression grounding, also known as referring expression comprehension, is often formulated as an object retrieval task [11,26]. [39,23,41] explored context information in images, and [31] proposed multi-step reasoning by multi-hop Feature-wise Linear Modulation. Hu et al [10] proposed compositional modular networks, composed of a localization module and a relationship module, to identify subjects, objects and their relationships.…”

Section: Related Workmentioning

confidence: 99%

Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing

Liu

Wang²,

Shao³

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

157

View full text Add to dashboard Cite

Referring expression grounding aims at locating certain objects or persons in an image with a referring expression, where the key challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, location and interactions with surrounding regions. Although the attention mechanism has been successfully applied for cross-modal alignments, previous attention models focus on only the most dominant features of both modalities, and neglect the fact that there could be multiple comprehensive textual-visual correspondences between images and referring expressions. To tackle this issue, we design a novel cross-modal attention-guided erasing approach, where we discard the most dominant information from either textual or visual domains to generate difficult training samples online, and to drive the model to discover complementary textual-visual correspondences. Extensive experiments demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance on three referring expression grounding datasets.

show abstract

Section: Related Workmentioning

confidence: 99%

Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing

Liu

Wang²,

Shao³

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

157

View full text Add to dashboard Cite

show abstract

“…To the best of our knowledge, all existing work use the same baseline Oracle [8] except [32]. We compare the performance of the baseline oracles with the proposed VilBERT-Oracle.…”

Section: The Oracle Modelmentioning

confidence: 99%

Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

Ping

Thattai

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

GuessWhat?! is a visual dialog guessing game which incorporates a Questioner agent that generates a sequence of questions, while an Oracle agent answers the respective questions about a target object in an image. Based on this dialog history between the Questioner and the Oracle, a Guesser agent makes a final guess of the target object. While previous work has focused on dialogue policy optimization and visual-linguistic information fusion, most work learns the vision-linguistic encoding for the three agents solely on the GuessWhat?! dataset without shared and prior knowledge of vision-linguistic representation. To bridge these gaps, this paper proposes new Oracle, Guesser and Questioner models that take advantage of a pretrained vision-linguistic model, VilBERT. For Oracle model, we introduce a two-way background/target fusion mechanism to understand both intra and inter-object questions. For Guesser model, we introduce a state-estimator that best utilizes VilBERT's strength in single-turn referring expression comprehension. For the Questioner, we share the stateestimator from pretrained Guesser with Questioner to guide the question generator. Experimental results show that our proposed models outperform state-of-the-art models significantly by 7%, 10%, 12% for Oracle, Guesser and End-to-End Questioner respectively.

show abstract

“…Guesser model is evaluated by classification error rate. The 2 baseline models [6]: HRED, HRED-VGG, 3 attention-based models PLAN [28], A-ATT [7], HACAN [25], and 2 Feature-wise Linear Modulation (FiLM) models: single-hop FiLM [14], multi-hop FiLM [23], are compared. Table 3 compares the test error of Guess models.…”

Section: Evaluation Metric and Comparison Modelsmentioning

confidence: 99%

Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue

Feng

Wang

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

A goal-oriented visual dialogue involves multi-turn interactions between two agents, Questioner and Oracle. During which, the answer given by Oracle is of great significance, as it provides golden response to what Questioner concerns. Based on the answer, Questioner updates its belief on target visual content and further raises another question. Notably, different answers drive into different visual beliefs and future questions. However, existing methods always indiscriminately encode answers after much longer questions, resulting in a weak utilization of answers. In this paper, we propose an Answer-Driven Visual State Estimator (ADVSE) to impose the effects of different answers on visual states. First, we propose an Answer-Driven Focusing Attention (ADFA) to capture the answerdriven effect on visual attention by sharpening question-related attention and adjusting it by answer-based logical operation at each turn. Then based on the focusing attention, we get the visual state estimation by Conditional Visual Information Fusion (CVIF), where overall information and difference information are fused conditioning on the question-answer state. We evaluate the proposed ADVSE to both question generator and guesser tasks on the large-scale GuessWhat?! dataset and achieve the state-of-the-art performances on both tasks. The qualitative results indicate that the ADVSE boosts the agent to generate highly efficient questions and obtains reliable visual attentions during the reasonable question generation and guess processes. CCS CONCEPTS • Computing methodologies → Computer vision tasks; Discourse, dialogue and pragmatics; Natural language generation; Computer vision representations.

show abstract

Visual Reasoning with Multi-hop Feature Modulation

Cited by 22 publications

References 36 publications

Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing

Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing

Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue

Contact Info

Product

Resources

About