Jointly Learning Grounded Task Structures from Language Instruction
            and Visual Demonstration

Liu, Changsong; Yang, Shaolong; Saba-Sadiya, Sari; Shukla, Nishant; He, Yunzhong; Zhu, Song‐Chun; Chai, Joyce

doi:10.18653/v1/d16-1155

Cited by 31 publications

(17 citation statements)

References 31 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Originated in the robotics community, learning from demonstration (LfD) (Thomaz and Cakmak, 2009;Argall et al, 2009) enables robots to learn a mapping from world states to robots' manipulations based on human's demonstration of desired robot behaviors. More recent work has also explored the use of natural language and dialogue together with demonstration to teach robots new actions (Mohan and Laird, 2014;Scheutz et al, 2017;Liu et al, 2016;She and Chai, 2017;Chai et al, 2018;Gluck and Laird, 2018).…”

Section: Related Workmentioning

confidence: 99%

Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

Zhang¹,

Chai²

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Self Cite

View full text Add to dashboard Cite

Despite recent progress, learning new tasks through language instructions remains an extremely challenging problem. On the AL-FRED benchmark for task learning, the published state-of-the-art system only achieves a task success rate of less than 10% in an unseen environment, compared to the human performance of over 90%. To address this issue, this paper takes a closer look at task learning. In a departure from a widely applied end-toend architecture, we decomposed task learning into three sub-problems: sub-goal planning, scene navigation, and object manipulation; and developed a model HiTUT 1 (stands for Hierarchical Tasks via Unified Transformers) that addresses each sub-problem in a unified manner to learn a hierarchical task structure. On the ALFRED benchmark, HiTUT has achieved the best performance with a remarkably higher generalization ability. In the unseen environment, HiTUT achieves over 160% performance gain in success rate compared to the previous state of the art. The explicit representation of task structures also enables an in-depth understanding of the nature of the problem and the ability of the agent, which provides insight for future benchmark development and evaluation.

show abstract

Section: Related Workmentioning

confidence: 99%

Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

Zhang¹,

Chai²

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…Procedural text understanding and knowledge extraction (Chu et al, 2017;Park and Motahari Nezhad, 2018;Kiddon et al, 2015;Jermsurawong and Habash, 2015;Liu et al, 2016;Long et al, 2016;Maeta et al, 2015;Malmaud et al, 2014;Artzi and Zettlemoyer, 2013;Kuehne et al, 2017) (Kiddon et al, 2015;Jermsurawong and Habash, 2015), our approach differs as we extract knowledge from the visual signals and transcripts directly, not from imperative recipe texts. Instructional video understanding.…”

Section: Related Workmentioning

confidence: 99%

A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

Xu¹,

Ji²,

Shi³

et al. 2020

Proceedings of the First International Workshop on Natural Language Processing Beyond Text

View full text Add to dashboard Cite

Watching instructional videos are often used to learn about procedures. Video captioning is one way of automatically collecting such knowledge. However, it provides only an indirect, overall evaluation of multimodal models with no finer-grained quantitative measure of what they have learned. We propose instead, a benchmark of structured procedural knowledge extracted from cooking videos. This work is complementary to existing tasks, but requires models to produce interpretable structured knowledge in the form of verb-argument tuples. Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations. Our analysis shows that the proposed task is challenging and standard modeling approaches like unsupervised segmentation, semantic role labeling, and visual action detection perform poorly when forced to predict every action of a procedure in structured form.

show abstract

“…While several methods have been developed for learning the grounding of instructions into logical forms for a robot to carry out a plan [2,3], these do not allow the flexibility required for the type of interaction in (1) and rely on explicit verb forms which are directly grounded in a corresponding action. Even if statistical NLU methods allow for some flexibility in the form, these still only permit a command-and-control Human-Robot Interaction (HRI) with long waiting times and no ability to adjust plans on the fly.…”

Section: Amentioning

confidence: 99%

“…Firstly, we follow [3] in showing how a hierarchical structure can capture simple robotic tasks in a useful way for NLU. Fig.…”

Section: Hri Intentions As Adjustable Hierarchical Action Graphsmentioning

confidence: 99%

Grounding Imperatives to Actions is Not Enough: A Challenge for Grounded NLU for Robots from Human-Human data

Hough¹,

Zarrieß²,

Schlangen³

2017

GLU 2017 International Workshop on Grounding Language Understanding

View full text Add to dashboard Cite

We present a proposal for a Natural Language Understanding method for simple pick-and-place robots which maps utterances to different levels in an action hierarchy. The hierarchy is a graph containing both lower-level action and higher-level goal levels. This attempts to overcome the surprising lack of overt imperative verb forms in natural task-oriented dialogue, which we show to be the case statistically in a human-human corpus. This proposal shifts the task away from mapping utterances to either actions or goals exclusively, and instead allows flexible mapping to both actions and goals during the interaction. We also show how a continuous communicative grounding mechanism is vital for achieving fluid interaction and show how confirmations and repairs can refer to both the goal and action levels, and that reliance on these overt signals of understanding alone is inadequate for a natural model.

show abstract

Jointly Learning Grounded Task Structures from Language Instruction and Visual Demonstration

Cited by 31 publications

References 31 publications

Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

Grounding Imperatives to Actions is Not Enough: A Challenge for Grounded NLU for Robots from Human-Human data

Contact Info

Product

Resources

About