TEACh: Task-driven Embodied Agents that Chat

Padmakumar, Aishwarya; Thomason, Jesse; Shrivastava, Ayush; Lange, Patrick; Narayan-Chen, Anjali; Gella, Spandana; Piramuthu, Robinson; Tür, Gökhan; Hakkani‐Tür, Dilek

doi:10.48550/arxiv.2110.00534

Cited by 6 publications

(10 citation statements)

References 20 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other work has been done on grounded language agents but their capabilities are restricted to a limited task space (Bara et al, 2021;Thomason et al, 2019;Padmakumar et al, 2022). However, recent work in simulation and robotics using human demonstration data has pushed the capabilities of grounded language agents toward open-ended, language-conditioned interaction (Lynch et al, 2020;Abramson et al, 2020;DeepMind Interactive Agents Team et al, 2021;Padmakumar et al, 2021). To our knowledge, this work presents the first careful study of a metric that evaluates open-ended interaction in a standardised way.…”

Section: Related Workmentioning

confidence: 98%

Evaluating Multimodal Interactive Agents

Abramson¹,

Ahuja²,

Carnevale³

et al. 2022

Preprint

View full text Add to dashboard Cite

Creating agents that can interact naturally with humans is a common goal in artificial intelligence (AI) research. However, evaluating these interactions is challenging: collecting online human-agent interactions is slow and expensive, yet faster proxy metrics often do not correlate well with interactive evaluation. In this paper, we assess the merits of these existing evaluation metrics and present a novel approach to evaluation called the Standardised Test Suite (STS). The STS uses behavioural scenarios mined from real human interaction data. Agents see replayed scenario context, receive an instruction, and are then given control to complete the interaction offline. These agent continuations are recorded and sent to human annotators to mark as success or failure, and agents are ranked according to the proportion of continuations in which they succeed. The resulting STS is fast, controlled, interpretable, and representative of naturalistic interactions. Altogether, the STS consolidates much of what is desirable across many of our standard evaluation metrics, allowing us to accelerate research progress towards producing agents that can interact naturally with humans. https://youtu.be/YR1TngGORGQ

show abstract

Section: Related Workmentioning

confidence: 98%

Evaluating Multimodal Interactive Agents

Abramson¹,

Ahuja²,

Carnevale³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…ALFRED [2], a recently proposed benchmark along this direction, requires the agent to complete complex household tasks by following natural language instructions. Dialogue-enabled agents in navigation or manipulation tasks have recently been proposed [14], [15] -these focus on action prediction from dialogue history, and do not emphasize the agent's ability to ask taskappropriate questions. In this paper, we take a further step in dialogue-enabled agents by presenting a benchmark for the agent to actively ask questions and learn from the answers to better finish the task.…”

Section: Related Workmentioning

confidence: 99%

DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following

Gao,

Gong

et al. 2022

Preprint

View full text Add to dashboard Cite

Language-guided Embodied AI benchmarks requiring an agent to navigate an environment and manipulate objects typically allow one-way communication: the human user gives a natural language command to the agent, and the agent can only follow the command passively. We present DialFRED, a dialogue-enabled embodied instruction following benchmark based on the ALFRED benchmark. DialFRED allows an agent to actively ask questions to the human user; the additional information in the user's response is used by the agent to better complete its task. We release a humanannotated dataset with 53K task-relevant questions and answers and an oracle to answer questions. To tackle DialFRED, we propose a questioner-performer framework wherein the questioner is pre-trained with the human-annotated data and fine-tuned with reinforcement learning. Experimental results show that asking the right questions leads to significantly improved task performance. We make DialFRED publicly available and encourage researchers to propose and evaluate their solutions to building dialog-enabled embodied agents: https://github.com/anonrabit/DialFRED

show abstract

“…Embodied Demonstrations from Humans. Prior expert demonstration datasets for embodied tasks combining vision and action (and optionally language) can be broadly categorized into either consisting of shortest-path trajectories from a planner with privileged information [5,7,8,29], or consisting of human-provided trajectories [23][24][25]. While some works in the former collect natural language data from hu-mans [5,7], we contend that collecting navigation data from humans is equally crucial.…”

Section: Related Workmentioning

confidence: 99%

“…Datasets with human-provided navigation trajectories are typically small. TEACh [23], CVDN [24] and WAY [25] have <10k episodes, while the EmbodiedQA [8] dataset has ∼700 human-provided episodes -all prohibitively small for training proficient agents. A key contribution of our work is a scalable webbased infrastructure for collecting human navigation and interaction demonstrations, that is easily extensible to any task situated in the Habitat [19] simulator, including languagebased tasks.…”

Section: Related Workmentioning

confidence: 99%

“…In total we collect 92k human demonstrations, 80k demonstrations for OBJECTNAV and 12k demonstrations for PICK&PLACE. In contrast, the largest existing datasets have 3-10k human demonstrations in simulation [23][24][25] or on real robots [26,27], an order of magnitude smaller. This virtual teleoperation data contains 29.3M actions, which is equivalent to 22, 600 hours of real-world teleoperation time assuming a LoCoBot motion model from [28] (details in appendix (Sec.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale

Ramrakhya¹,

Undersander²,

Batra³

et al. 2022

Preprint

View full text Add to dashboard Cite

We present a large-scale study of imitating human demonstrations on tasks that require a virtual robot to search for objects in new environments -(1) ObjectGoal Navigation (e.g. 'find & go to a chair') and ( 2) PICK&PLACE (e.g. 'find mug, pick mug, find counter, place mug on counter'). First, we develop a virtual teleoperation data-collection infrastructure -connecting Habitat simulator running in a web browser to Amazon Mechanical Turk, allowing remote users to teleoperate virtual robots, safely and at scale. We collect 80k demonstrations for OBJECTNAV and 12k demonstrations for PICK&PLACE, which is an order of magnitude larger than existing human demonstration datasets in simulation or on real robots. Our virtual teleoperation data contains 29.3M actions, and is equivalent to 22.6k hours of real-world teleoperation time, and illustrates rich, diverse strategies for solving the tasks. Second, we use this data to answer the question -how does large-scale imitation learning (IL) (which has not been hitherto possible) compare to reinforcement learning (RL) (which is the status quo)? On OBJECTNAV, we find that IL (with no bells or whistles) using 70k human demonstrations outperforms RL using 240k agent-gathered trajectories. This effectively establishes an 'exchange rate' -a single human demonstration appears to be worth ∼4 agent-gathered ones. More importantly, we find the IL-trained agent learns efficient object-search behavior from humans -it peeks into rooms, checks corners for small objects, turns in place to get a panoramic view -none of these are exhibited as prominently by the RL agent, and to induce these behaviors via contemporary RL techniques would require tedious reward engineering. Finally, accuracy vs. training data size plots show promising scaling behavior, suggesting that simply collecting more demonstrations is likely to advance the state of art further. On PICK&PLACE, the comparison is starker -IL agents achieve ∼18% success on episodes with new object-receptacle locations when trained with 9.5k human demonstrations, while RL agents fail to get beyond 0%. Overall, our work provides compelling evidence for investing in large-scale imitation learning. Project page: ram81.github.io/projects/habitat-web.

show abstract

TEACh: Task-driven Embodied Agents that Chat

Cited by 6 publications

References 20 publications

Evaluating Multimodal Interactive Agents

Evaluating Multimodal Interactive Agents

DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following

Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale

Contact Info

Product

Resources

About