2021
DOI: 10.48550/arxiv.2104.05845
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Visual Goal-Step Inference using wikiHow

Abstract: Procedural events can often be thought of as a high level goal composed of a sequence of steps. Inferring the sub-sequence of steps of a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task where a model is given a textual goal and must choose a plausible step towards that goal from among four candidate images. Our task is challe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 25 publications
(15 reference statements)
0
3
0
Order By: Relevance
“…Several existing works have also utilized WikiHow for learning to understand task knowledge, specifically in NLP, textual descriptions of WikiHow have been used for abstractive summarization (Koupaee and Wang, 2018), procedural understanding Tandon et al, 2020), and intent estimation (Zhang et al, 2020a). As WikiHow includes multimodal information for task knowledge, concurrent work also uses it as a resource for understanding visual goals (Yang et al, 2021). We believe our work on utilizing WikiHow as a resource for the sequencing task can help advancing towards the goal of comprehensive multimodal procedural understanding.…”
Section: Related Workmentioning
confidence: 99%
“…Several existing works have also utilized WikiHow for learning to understand task knowledge, specifically in NLP, textual descriptions of WikiHow have been used for abstractive summarization (Koupaee and Wang, 2018), procedural understanding Tandon et al, 2020), and intent estimation (Zhang et al, 2020a). As WikiHow includes multimodal information for task knowledge, concurrent work also uses it as a resource for understanding visual goals (Yang et al, 2021). We believe our work on utilizing WikiHow as a resource for the sequencing task can help advancing towards the goal of comprehensive multimodal procedural understanding.…”
Section: Related Workmentioning
confidence: 99%
“…Datasets Related to Future Prediction Task Three types of datasets have been used in previous research on future prediction tasks: visual-only (Damen et al 2022(Damen et al , 2018Li, Liu, and Rehg 2018), text-only (Puig et al 2018;Lyu, Zhang, and Callison-Burch 2021;Le et al 2023), and multimedia datasets (Yang et al 2021b;Tang et al 2019;Miech et al 2019b;Xu et al 2023b). Our dataset distinguishes itself from these previous works in two aspects: (1) Compared with visual/text-only dataset, our dataset is a multimedia dataset, comprising video, image, and text descriptions for each instructional step.…”
Section: Related Workmentioning
confidence: 99%
“…Script Learning: Scripts (Schank and Abelson, 2013;Feigenbaum et al, 1981;Yang et al, 2021;Zhang et al, 2020b,a) refers to the knowledge of stereotypical event sequences which human is constantly experiencing and repeating every day. One branch of works in script learning focuses on distilling narrative scripts from news or stories (Chambers and Jurafsky, 2008;Jans et al, 2012;Lee and Goldwasser, 2019), where the scripts are not goaloriented.…”
Section: Related Workmentioning
confidence: 99%