TEACh: Task-Driven Embodied Agents That Chat

Padmakumar, Aishwarya; Thomason, Jesse; Shrivastava, Ayush; Lange, Patrick; Narayan-Chen, Anjali; Gella, Spandana; Piramuthu, Robinson; Tür, Gökhan; Hakkani-Tür, Dilek

doi:10.1609/aaai.v36i2.20097

Cited by 39 publications

(25 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Embodied AI. The development of learning-based embodied AI agents has made significant progress across a wide variety of tasks, including: scene rearrangement [3,17,38], object-goal navigation [1,6,8,19,41,43], point-goal navigation [1,19,30,31,40], scene exploration [7,10], embodied question answering [12,18], instructional navigation [2,35], object manipulation [14,44], home task completion with explicit instructions [27,35,36], active visual learning [9,15,20,39], and collaborative task completion with agent-human conversations [29]. While these works have driven much progress in embodied AI, ours is the first agent to tackle the task of tidying up rooms, which requires commonsense reasoning about whether or not an object is out of place, and inferring where it belongs in the context of the room.…”

Section: Related Workmentioning

confidence: 99%

TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors

Sarch¹,

Fang²,

Harley³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce TIDEE, an embodied agent that tidies up a disordered scene based on learned commonsense object placement and room arrangement priors. TIDEE explores a home environment, detects objects that are out of their natural place, infers plausible object contexts for them, localizes such contexts in the current scene, and repositions the objects. Commonsense priors are encoded in three modules: i) visuo-semantic detectors that detect out-of-place objects, ii) an associative neural graph memory of objects and spatial relations that proposes plausible semantic receptacles and surfaces for object repositions, and iii) a visual search network that guides the agent's exploration for efficiently localizing the receptacle-of-interest in the current scene to reposition the object. We test TIDEE on tidying up disorganized scenes in the AI2THOR simulation environment. TIDEE carries out the task directly from pixel and raw depth input without ever having observed the same room beforehand, relying only on priors learned from a separate set of training houses. Human evaluations on the resulting room reorganizations show TIDEE outperforms ablative versions of the model that do not use one or more of the commonsense priors. On a related room rearrangement benchmark that allows the agent to view the goal state prior to rearrangement, a simplified version of our model significantly outperforms a top-performing method by a large margin. Code and data are available at the project website: https://tidee-agent.github.io/.

show abstract

Section: Related Workmentioning

confidence: 99%

TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors

Sarch¹,

Fang²,

Harley³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Vision-and-Language Navigation. Training embodied navigation agents has been an increasingly active research area (Anderson et al, 2018a,b;Chen et al, 2019;Ku et al, 2020;Shridhar et al, 2020;Padmakumar et al, 2022). Fried et al (2018b) propose to augment the training data with the speaker-follower models, which is improve by Tan et al ( 2019) who add noise into the environments so that the speaker can generate more diverse instructions.…”

Section: Related Workmentioning

confidence: 99%

FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation

Dou¹,

Peng²

2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

The speaker-follower models have proven to be effective in vision-and-language navigation, where a speaker model is used to synthesize new instructions to augment the training data for a follower navigation model. However, in many of the previous methods, the generated instructions are not directly trained to optimize the performance of the follower. In this paper, we present FOAM, a FOllower-Aware speaker Model that is constantly updated given the follower feedback, so that the generated instructions can be more suitable to the current learning state of the follower. Specifically, we optimize the speaker using a bi-level optimization framework and obtain its training signals by evaluating the follower on labeled data. Experimental results on the Room-to-Room and Room-across-Room datasets demonstrate that our methods can outperform strong baseline models across settings. Analyses also reveal that our generated instructions are of higher quality than the baselines. 1

show abstract

“…Researchers in the Interaction Lab have shown that previous work on so-called 'Visual Dialog' does not really require taking dialogue context into account, and proposed new visual dialogue datasets where linguistic context matters [3]. We are currently working to further develop interactive systems for learning grounded language, for example within the 2022 Amazon Alexa SimBot challenge [47,63]. Fig.…”

Section: Vision and Languagementioning

confidence: 99%

“…More recently, researchers in the Interaction Lab have developed deep learning systems such as 'Embodied BERT' (EmBERT) [62] which combine video streams and language to learn grounded language and action execution. Related to this work, we are currently the only European team participating in the Amazon Alexa SimBot challenge 9 (2022) which works on the TEACh dataset [47] of videos combined with conversations about household tasks (see Fig. 3).…”

Section: Embodied Interactionmentioning

confidence: 99%

Conversational AI for multi-agent communication in Natural Language

Lemon

2022

AIC

View full text Add to dashboard Cite

Research at the Interaction Lab focuses on human-agent communication using conversational Natural Language. The ultimate goal is to create systems where humans and AI agents (including embodied robots) can spontaneously form teams and coordinate shared tasks through the use of Natural Language conversation as a universal communication interface. This paper first introduces machine learning approaches to problems in conversational AI in general, where computational agents must coordinate with humans to solve tasks using conversational Natural Language. It also covers some of the practical systems developed in the Interaction Lab, ranging from speech interfaces on smart speakers to embodied robots interacting using visually grounded language. In several cases communication between multiple agents is addressed. The paper surveys the central research problems addressed here, the approaches developed, and our main results. Some key open research questions and directions are then discussed, leading towards a future vision of conversational, collaborative multi-agent systems.

show abstract

TEACh: Task-Driven Embodied Agents That Chat

Cited by 39 publications

References 31 publications

TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors

TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors

FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation

Conversational AI for multi-agent communication in Natural Language

Contact Info

Product

Resources

About