Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

Suglia, Alessandro; Vergari, Antonio; Konstas, Ioannis; Bisk, Yonatan; Bastianelli, Emanuele; Vanzo, Andrea; Lemon, Oliver

doi:10.18653/v1/2020.coling-main.95

Cited by 8 publications

(11 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An accurate representation of state needs to be maintained as new information and observations accumulate [68]. Again, in recent deep learning approaches, no explicit state representation is developed, and the state information is encoded using sequences of prior turns in the interaction [10,62,65].…”

Section: State Trackingmentioning

confidence: 99%

See 1 more Smart Citation

Conversational AI for multi-agent communication in Natural Language

Lemon

2022

AIC

Self Cite

View full text Add to dashboard Cite

Research at the Interaction Lab focuses on human-agent communication using conversational Natural Language. The ultimate goal is to create systems where humans and AI agents (including embodied robots) can spontaneously form teams and coordinate shared tasks through the use of Natural Language conversation as a universal communication interface. This paper first introduces machine learning approaches to problems in conversational AI in general, where computational agents must coordinate with humans to solve tasks using conversational Natural Language. It also covers some of the practical systems developed in the Interaction Lab, ranging from speech interfaces on smart speakers to embodied robots interacting using visually grounded language. In several cases communication between multiple agents is addressed. The paper surveys the central research problems addressed here, the approaches developed, and our main results. Some key open research questions and directions are then discussed, leading towards a future vision of conversational, collaborative multi-agent systems.

show abstract

Section: State Trackingmentioning

confidence: 99%

“…Another example is using context to re-rank possible system responses -either at the level of DM or NLG decision-making [56]. A particular recent focus is on the use and adaptation of large pre-trained vision-and-language models in interactive systems [63,65].…”

Section: Machine Learning Models Of Language Processingmentioning

confidence: 99%

Conversational AI for multi-agent communication in Natural Language

Lemon

2022

AIC

Self Cite

View full text Add to dashboard Cite

show abstract

“…By operating only in simulation, our model also misses the full range of experience that can ground language in the world [11], such as haptic feedback during object manipulation [78,79,68], and audio [16] and speech [31,41] features of the environment. Further, in ALFRED an agent never encounters novel object classes at inference time, which represent an additional challenge for successful task completion [72].…”

Section: Limitations and Impactmentioning

confidence: 99%

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

Suglia¹,

Gao²,

Thomason³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Language-guided robots performing home and office tasks must navigate in and interact with the world. Grounding language instructions against visual observations and actions to take in an environment is an open challenge. We present Embodied BERT (EmBERT), a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for languageconditioned task completion. 1 Additionally, we bridge the gap between successful object-centric navigation models used for non-interactive agents and the languageguided visual task completion benchmark, ALFRED, by introducing object navigation targets for EmBERT training. We achieve competitive performance on the ALFRED benchmark, and EmBERT marks the first transformer-based model to successfully handle the long-horizon, dense, multi-modal histories of ALFRED, and the first ALFRED model to utilize object-centric navigation targets.1 https://github.com/amazon-research/embert Preprint. Under review.

show abstract

“…Humans highly rely on their prediction skills when interpreting a new input, integrating their perceptual signal with prior knowledge. We hope that more awareness of cognitive and neuroscience findings towards the combination of bottom‐up (perceptual) and top‐down (prior) knowledge will help shaping new multimodal models (Schüz & Zarrieß, 2020; Suglia et al., 2020; Testoni, Pezzelle et al., 2019).…”

Section: Open Challenges and Future Directionsmentioning

confidence: 99%

Linguistic issues behind visual question answering

Bernardi

Pezzelle

2021

Language and Linguist. Compass

View full text Add to dashboard Cite

Answering a question that is grounded in an image is a crucial ability that requires understanding the question, the visual context, and their interaction at many linguistic levels: among others, semantics, syntax and pragmatics. As such, visually‐grounded questions have long been of interest to theoretical linguists and cognitive scientists. Moreover, they have inspired the first attempts to computationally model natural language understanding, where pioneering systems were faced with the highly challenging task—still unsolved—of jointly dealing with syntax, semantics and inference whilst understanding a visual context. Boosted by impressive advancements in machine learning, the task of answering visually‐grounded questions has experienced a renewed interest in recent years, to the point of becoming a research sub‐field at the intersection of computational linguistics and computer vision. In this paper, we review current approaches to the problem which encompass the development of datasets, models and frameworks. We conduct our investigation from the perspective of the theoretical linguists; we extract from pioneering computational linguistic work a list of desiderata that we use to review current computational achievements. We acknowledge that impressive progress has been made to reconcile the engineering with the theoretical view. At the same time, we claim that further research is needed to get to a unified approach which jointly encompasses all the underlying linguistic problems. We conclude the paper by sharing our own desiderata for the future.

show abstract

Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

Cited by 8 publications

References 28 publications

Conversational AI for multi-agent communication in Natural Language

Conversational AI for multi-agent communication in Natural Language

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

Linguistic issues behind visual question answering

Contact Info

Product

Resources

About