Proceedings of the 2019 Conference of the North 2019
DOI: 10.18653/v1/n19-1265
|View full text |Cite
|
Sign up to set email alerts
|

Beyond task success: A closer look at jointly learning to see, ask, and

Abstract: We propose a grounded dialogue state encoder which addresses a foundational issue on how to integrate visual grounding with dialogue system components. As a test-bed, we focus on the GuessWhat?! game, a two-player game where the goal is to identify an object in a complex visual scene by asking a sequence of yes/no questions. Our visually-grounded encoder leverages synergies between guessing and asking questions, as it is trained jointly using multitask learning. We further enrich our model via a cooperative le… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
80
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 43 publications
(81 citation statements)
references
References 27 publications
1
80
0
Order By: Relevance
“…The Reinforcement Learning (RL) model casts the problem into a reinforcement learning task and trains the previous model with policy gradient. The Visually-Grounded State Encoder (GDSE) models, both Supervised Learning (SL) and Cooperative Learning (CL) (Shekhar et al, 2019) use a visually grounded dialogue state that takes the visual features and each new question to create a shared representation used for both QGen and Guesser. They differ in that SL is trained in a supervised fashion while CL samples new objects from pictures and makes the agents train in a cooperative learning fashion on those artificially generated games.…”
Section: Models and Experimentsmentioning
confidence: 99%
See 1 more Smart Citation
“…The Reinforcement Learning (RL) model casts the problem into a reinforcement learning task and trains the previous model with policy gradient. The Visually-Grounded State Encoder (GDSE) models, both Supervised Learning (SL) and Cooperative Learning (CL) (Shekhar et al, 2019) use a visually grounded dialogue state that takes the visual features and each new question to create a shared representation used for both QGen and Guesser. They differ in that SL is trained in a supervised fashion while CL samples new objects from pictures and makes the agents train in a cooperative learning fashion on those artificially generated games.…”
Section: Models and Experimentsmentioning
confidence: 99%
“…Following Shekhar et al (2019), we classify questions into different types and evaluate the Oracle accuracy for each type. We distinguish between eight types of questions.…”
Section: Analysis Of Oracle Accuracymentioning
confidence: 99%
“…Instead, Kottur et al (2019) proposed a diagnostic dataset to investigate model's language understanding: however, their dialogues are generated artificially and may not reflect the true nature of visual dialogues. Shekhar et al (2019) also acknowledges the importance of linguistic analysis but only dealt with coarse-level features that can be computed automatically (such as dialogue topic and diversity). Most similar and related to our research are ; Udagawa and Aizawa (2020), where they conducted additional annotation of reference resolution in visual dialogues: however, they still do not capture more sophisticated linguistic structures such as PAS, modification and ellipsis.…”
Section: Related Workmentioning
confidence: 99%
“…For example, recent work has aimed to bridge vision, interactive learning, and natural language understanding through language learning tasks based on natural images Kazemzadeh et al, 2014;De Vries et al, 2017a;Kim et al, 2020). The work on visual dialogue games (Geman et al, 2015) brings new resources and models for generating referring expression for referents in images (Suhr et al, 2019;Shekhar et al, 2018), visually grounded spoken language communication (Roy, 2002;Gkatzia et al, 2015), and captioning (Levinboim et al, 2019;Alikhani and Stone, 2019), which can be used creatively to demonstrate how a system understand a user. Figure 1 shows two examples of models that understand and generate referring expressions in multimodal settings.…”
Section: Reading Listmentioning
confidence: 99%