Looking for Confirmations: An Effective and Human-Like Visual Dialogue Strategy

Testoni, Alberto; Bernardi, Raffaella

doi:10.18653/v1/2021.emnlp-main.736

Cited by 7 publications

(12 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Existing interactive robots/agents using multimodal features have focused on question-answering from images [14], [15], request analysis [2], and conversations about images [16], [17], [18], [19]. Whether incorporating situation understanding results from multimodal cues significantly improves related tasks has been investigated to determine if we can clearly define things to be recognized for tasks.…”

Section: B Using Multimodal Cues For Action Decisionsmentioning

confidence: 99%

Do as I Demand, Not as I Say: A Dataset for Developing a Reflective Life-Support Robot

Tanaka,

Yamasaki,

Yuguchi

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Interactive robots that cooperate with humans must take appropriate actions in response to their requests. Unfortunately, such requests often have information gaps with their actual demands. However, robots are still expected to reason and act on what is required, depending on the situation. We call these reflective actions. To achieve such reflective actions for robots, we constructed a dataset that consists of the reflective actions of a domestic manipulation robot, in which the actions correspond to user utterances with their surroundings situations. By crowdsourcing, we defined several action scenarios that could be regarded as reflective. We recorded videos of situations described in the crowdsourcing scenarios, corresponding to the user situations just before the robot's reflective actions. We also annotated the videos of the user utterance transcriptions, objects, user poses, and user positions to investigate the contribution of such descriptive features to the reflective action decisions. Our experimental results indicated that even though our newly defined task is very challenging, it can be solved if the system has a concrete understanding of the situation.

show abstract

Section: B Using Multimodal Cues For Action Decisionsmentioning

confidence: 99%

Do as I Demand, Not as I Say: A Dataset for Developing a Reflective Life-Support Robot

Tanaka,

Yamasaki,

Yuguchi

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Simulating Dual-coding theory of human cognition to adaptively find query-related information from the image. Testoni et al [161] Asking questions to confirm the conjecture of models about the referent guided by human cognitive literature.…”

Section: Unique Training Schemesbased Vadmentioning

confidence: 99%

“…Motivated by Dual-coding theory [124] of human cognition, Dual Encoding Visual Dialogue (DualVD) model [65] adaptively finds query-related information from the image through intra-modal visual features and inter-modal visual-semantic knowledge semantics. Based on a beam search re-ranking algorithm, Testoni et al propose Confirm-it [161], which asks questions to confirm the conjecture of models about the referent with human cognitive literature on information search and cross-situational word learning. To explore the ability of AI dialogue agents to both ask questions and answer them as humans, researchers have made preliminary explorations.…”

Section: Unique Trainingmentioning

confidence: 99%

“…Moreover, Gatt et al [33] point out that human to be overspecific and prefer properties irrespectively when referring to objects. Based on this, Testoni et al [161] propose the Confirm-it model to generate questions driven by the agent's confirmation bias for human-like dialogue generation. A worthwhile future research point is to explore the cognitive mechanisms of human-machine dialogue under cross-modal dialogue context to provide sufficient priori knowledge for data-driven deep models, thus understanding visual context in a more efficient and informative way and generating more anthropomorphic responses.…”

Section: The Cognitive Mechanisms Of Human-machine Dialogue Under Cro...mentioning

confidence: 99%

See 1 more Smart Citation

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

Wang¹,

Guo²,

Zeng³

et al. 2022

Preprint

View full text Add to dashboard Cite

The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial intelligence. With the gradually complex human-computer interaction requirements (e.g., multimodal inputs, time sensitivity), it is difficult for traditional text-based dialogue system to meet the demands for more vivid and convenient interaction. Consequently, Visual-Context Augmented Dialogue System (VAD), which has the potential to communicate with humans by perceiving and understanding multimodal information (i.e., visual context in images or videos, textual dialogue history), has become a predominant research paradigm. Benefiting from the consistency and complementarity between visual and textual context, VAD possesses the potential to generate engaging and context-aware responses. For depicting the development of VAD, we first characterize the concepts and unique features of VAD, and then present its generic system architecture to illustrate the system workflow. Subsequently, several research challenges and representative works are detailed investigated, followed by the summary of authoritative benchmarks. We conclude this paper by putting forward some open issues and promising research trends for VAD, e.g., the cognitive mechanisms of human-machine dialogue under cross-modal dialogue context, and knowledge-enhanced cross-modal semantic interaction.CCS Concepts: • Human-centered computing → HCI theory, concepts and models; • Computing methodologies → Discourse, dialogue and pragmatics.

show abstract

“…Starting from a probability distribution over all candidate tokens in the vocabulary, this technique samples the next token from the set of candidates defined as the top-p subset of the cumulative probability mass. Recently, Testoni and Bernardi (2021b) propose a beam-search re-ranking strategy to promote the generation of more effective questions throughout the dialogue. In this paper, we focus on the effect of different training sets using the same decoding strategy.…”

Section: Figurementioning

confidence: 99%

Garbage In, Flowers Out: Noisy Training Data Help Generative Models at Test Time

Testoni¹,

Bernardi²

2022

ijcol

Self Cite

View full text Add to dashboard Cite

Despite important progress, conversational systems often generate dialogues that sound unnatural to humans. We conjecture that the reason lies in the different training and testing conditions: agents are trained in a controlled "lab" setting but tested in the "wild". During training, they learn to utter a sentence given the ground-truth dialogue history generated by human annotators. On the other hand, during testing, the agents must interact with each other, and hence deal with noisy data. We propose to fill this gap between the training and testing environments by training the model with mixed batches containing both samples of human and machine-generated dialogues. We assess the validity of the proposed method on GuessWhat?!, a visual referential game. We show that our method improves the linguistic quality of the generated dialogues, and it leads to higher accuracy of the guessing task; simple perturbations of the ground-truth dialogue history that mimic machine-generated data do not account for a similar improvement. Finally, we run a human evaluation experiment on a sample of machine-machine dialogues to complement the quantitative analysis. This experiment shows that also human annotators successfully exploit dialogues generated by a model trained with mixed batches to solve the task. Hence, the mixed-batch training does not cause a language drift. Moreover, we find that the new training regime allows human annotators to be significantly more confident when selecting the target object, showing that the generated dialogues are informative.

show abstract

Looking for Confirmations: An Effective and Human-Like Visual Dialogue Strategy

Cited by 7 publications

References 21 publications

Do as I Demand, Not as I Say: A Dataset for Developing a Reflective Life-Support Robot

Do as I Demand, Not as I Say: A Dataset for Developing a Reflective Life-Support Robot

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

Garbage In, Flowers Out: Noisy Training Data Help Generative Models at Test Time

Contact Info

Product

Resources

About