Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

Zhang, Yichi; Chai, Joyce

doi:10.18653/v1/2021.findings-acl.368

Cited by 26 publications

(11 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In “MoViLan + PerfectMap”, ground truth BEV maps are provided to the agent as an ablation study for removal of our mapping module. Our framework demonstrates superior performance compared to the baseline ( Shridhar et al, 2020 ) algorithms, Moca ( Singh et al, 2020 ), HiTUT ( Zhang and Chai, 2021 ), HLSM ( Blukis et al, 2022 ), and LWIT ( Nguyen et al, 2021 ) on complete tasks. For sub-goal tasks, our framework has significantly higher path weighted success rates for “GoTo” compared to previous works (language instructions requiring pure navigation) because of novel mapping module, and hence higher overall success rates due to better positioning.…”

Section: Resultsmentioning

confidence: 92%

“…The VPM module in this study executes the interaction mask of the target object, and the APM module predicts the action sequence. HiTUT method ( Zhang and Chai, 2021 ) tries to increase the success rate of the ALFRED dataset by decomposing task learning into three sub tasks; sub-goal planning, scene navigation and object manipulation. All three sub tasks share the similar input form; therefore they solve together by applying an unified model upon on multi-task learning.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Modular Vision Language Navigation and Manipulation Framework for Long Horizon Compositional Tasks in Indoor Environment

et al. 2022

View full text Add to dashboard Cite

In this paper we propose a new framework—MoViLan (Modular Vision and Language) for execution of visually grounded natural language instructions for day to day indoor household tasks. While several data-driven, end-to-end learning frameworks have been proposed for targeted navigation tasks based on the vision and language modalities, performance on recent benchmark data sets revealed the gap in developing comprehensive techniques for long horizon, compositional tasks (involving manipulation and navigation) with diverse object categories, realistic instructions and visual scenarios with non reversible state changes. We propose a modular approach to deal with the combined navigation and object interaction problem without the need for strictly aligned vision and language training data (e.g., in the form of expert demonstrated trajectories). Such an approach is a significant departure from the traditional end-to-end techniques in this space and allows for a more tractable training process with separate vision and language data sets. Specifically, we propose a novel geometry-aware mapping technique for cluttered indoor environments, and a language understanding model generalized for household instruction following. We demonstrate a significant increase in success rates for long horizon, compositional tasks over recent works on the recently released benchmark data set -ALFRED.

show abstract

Section: Resultsmentioning

confidence: 92%

Section: Introductionmentioning

confidence: 99%

A Modular Vision Language Navigation and Manipulation Framework for Long Horizon Compositional Tasks in Indoor Environment

et al. 2022

View full text Add to dashboard Cite

show abstract

“…• Learning from Explanations with Neural Execution Tree (Wang et al, 2020) • Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach (Yin et al, 2019) • Textual Entailment for Event Argument Extraction: Zero-and Few-Shot with Multi-Source Learning (Sainz et al, 2022) • Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing (Liu et al, 2021) • True Few-Shot Learning With Prompts-A Real-World Perspective (Schick and Schütze, 2022) • The Turking Test: Can Language Models Understand Instructions? (Efrat and Levy, 2020) • Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring (Zhang and Chai, 2021) • Cross-Task Generalization via Natural Language Crowdsourcing Instructions (Mishra et al, 2022) • MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction Following (Lou et al, 2023)…”

Section: A Appendixmentioning

confidence: 99%

LLM-driven Instruction Following: Progresses and Concerns

Yin,

Ye,

Liu

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

View full text Add to dashboard Cite

The progress of natural language processing (NLP) is primarily driven by machine learning that optimizes a system on a large-scale set of task-specific labeled examples. This learning paradigm limits the ability of machines to have the same capabilities as humans in handling new tasks since humans can often solve unseen tasks with a couple of examples accompanied by task instructions. In addition, we may not have a chance to prepare task-specific examples of large-volume for new tasks because we cannot foresee what task needs to be addressed next and how complex to annotate for it. Therefore, task instructions act as a novel and promising resource for supervision. This tutorial targets researchers and practitioners who are interested in AI and ML technologies for NLP generalization in a low-shot scenario. In particular, we will present a diverse thread of instruction-driven NLP studies that try to answer the following questions: (i) What is task instruction? (ii) How is the process of creating datasets and evaluating systems conducted? (iii) How to encode task instructions? (iv) When and why do some instructions work better? (v) What concerns remain in LLMdriven instruction following? We will discuss several lines of frontier research that tackle those challenges and will conclude the tutorial by outlining directions for further investigation.

show abstract

“…Several approaches have been proposed to solve this. Predominant methods exploit the embodied nature of a robotic agent to infer and refine the plan by primarily using multi-modal input that includes visual feedback and action priors (Paxton et al, 2019;Shridhar et al, 2020;Zhang and Chai, 2021;Ahn et al, 2022). Thus natural language understanding in these systems is simplified by obtaining a latent representation of the language input to bias the inference using attention modeling.…”

Section: Related Workmentioning

confidence: 99%

tagE: Enabling an Embodied Agent to Understand Human Instructions

Sarkar,

Mitra,

Pramanick

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Natural language serves as the primary mode of communication when an intelligent agent with a physical presence engages with human beings. While a plethora of research focuses on natural language understanding (NLU), encompassing endeavors such as sentiment analysis, intent prediction, question answering, and summarization, the scope of NLU directed at situations necessitating tangible actions by an embodied agent remains limited. The inherent ambiguity and incompleteness inherent in natural language present challenges for intelligent agents striving to decipher human intention. To tackle this predicament head-on, we introduce a novel system known as task and argument grounding for Embodied agents (tagE). At its core, our system employs an inventive neural network model designed to extract a series of tasks from complex task instructions expressed in natural language. Our proposed model adopts an encoder-decoder framework enriched with nested decoding to effectively extract tasks and their corresponding arguments from these intricate instructions. These extracted tasks are then mapped (or grounded) to the robot's established collection of skills, while the arguments find grounding in objects present within the environment. To facilitate the training and evaluation of our system, we have curated a dataset featuring complex instructions. The results of our experiments underscore the prowess of our approach, as it outperforms robust baseline models.

show abstract

Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

Cited by 26 publications

References 34 publications

A Modular Vision Language Navigation and Manipulation Framework for Long Horizon Compositional Tasks in Indoor Environment

A Modular Vision Language Navigation and Manipulation Framework for Long Horizon Compositional Tasks in Indoor Environment

LLM-driven Instruction Following: Progresses and Concerns

tagE: Enabling an Embodied Agent to Understand Human Instructions

Contact Info

Product

Resources

About