Visual Goal-Step Inference using wikiHow

Yang, Yuexiang; Panagopoulou, Artemis; Lyu, Qing; Zhang, Li; Yatskar, Mark; Callison-Burch, Chris

doi:10.48550/arxiv.2104.05845

Cited by 4 publications

(4 citation statements)

References 25 publications

(15 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several existing works have also utilized WikiHow for learning to understand task knowledge, specifically in NLP, textual descriptions of WikiHow have been used for abstractive summarization (Koupaee and Wang, 2018), procedural understanding Tandon et al, 2020), and intent estimation (Zhang et al, 2020a). As WikiHow includes multimodal information for task knowledge, concurrent work also uses it as a resource for understanding visual goals (Yang et al, 2021). We believe our work on utilizing WikiHow as a resource for the sequencing task can help advancing towards the goal of comprehensive multimodal procedural understanding.…”

Section: Related Workmentioning

confidence: 99%

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

Wu¹,

Spangher²,

Alipoormolabashi³

et al. 2021

Preprint

View full text Add to dashboard Cite

The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multisource instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information. To improve machines' performance on multimodal event sequencing, we propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images, resulting in >5% significant improvements.

show abstract

Section: Related Workmentioning

confidence: 99%

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

Wu¹,

Spangher²,

Alipoormolabashi³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Datasets Related to Future Prediction Task Three types of datasets have been used in previous research on future prediction tasks: visual-only (Damen et al 2022(Damen et al , 2018Li, Liu, and Rehg 2018), text-only (Puig et al 2018;Lyu, Zhang, and Callison-Burch 2021;Le et al 2023), and multimedia datasets (Yang et al 2021b;Tang et al 2019;Miech et al 2019b;Xu et al 2023b). Our dataset distinguishes itself from these previous works in two aspects: (1) Compared with visual/text-only dataset, our dataset is a multimedia dataset, comprising video, image, and text descriptions for each instructional step.…”

Section: Related Workmentioning

confidence: 99%

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Qi,

Liu,

Shen

et al. 2024

AAAI

View full text Add to dashboard Cite

Automatically generating scripts (i.e. sequences of key steps described in text) from video demonstrations and reasoning about the subsequent steps are crucial to the modern AI virtual assistants to guide humans to complete everyday tasks, especially unfamiliar ones. However, current methods for generative script learning rely heavily on well-structured preceding steps described in text and/or images or are limited to a certain domain, resulting in a disparity with real-world user scenarios. To address these limitations, we present a new benchmark challenge – MULTISCRIPT, with two new tasks on task-oriented multimodal script learning: (1) multimodal script generation, and (2) subsequent step prediction. For both tasks, the input consists of a target task name and a video illustrating what has been done to complete the target task, and the expected output is (1) a sequence of structured step descriptions in text based on the demonstration video, and (2) a single text description for the subsequent step, respectively. Built from WikiHow, MULTISCRIPT covers multimodal scripts in videos and text descriptions for over 6,655 human everyday tasks across 19 diverse domains. To establish baseline performance on MULTISCRIPT, we propose two knowledge-guided multimodal generative frameworks that incorporate the task-related knowledge prompted from large language models such as Vicuna. Experimental results show that our proposed approaches significantly improve over the competitive baselines.

show abstract

“…Script Learning: Scripts (Schank and Abelson, 2013;Feigenbaum et al, 1981;Yang et al, 2021;Zhang et al, 2020b,a) refers to the knowledge of stereotypical event sequences which human is constantly experiencing and repeating every day. One branch of works in script learning focuses on distilling narrative scripts from news or stories (Chambers and Jurafsky, 2008;Jans et al, 2012;Lee and Goldwasser, 2019), where the scripts are not goaloriented.…”

Section: Related Workmentioning

confidence: 99%

Incorporating Task-Specific Concept Knowledge into Script Learning

Sun,

Xu,

Zhai

et al. 2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

In this paper, we present TETRIS, a new task of Goal-Oriented Script Completion. Unlike previous work, it considers a more realistic and general setting, where the input includes not only the goal but also additional user context, including preferences and history. To address this problem, we propose a novel approach, which uses two techniques to improve performance: (1) concept prompting, and (2) scriptoriented contrastive learning that addresses step repetition and hallucination problems. On our WikiHow-based dataset, we find that both methods improve performance.

show abstract

Visual Goal-Step Inference using wikiHow

Cited by 4 publications

References 25 publications

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Incorporating Task-Specific Concept Knowledge into Script Learning

Contact Info

Product

Resources

About