Proceedings of the First International Workshop on Natural Language Processing Beyond Text 2020
DOI: 10.18653/v1/2020.nlpbt-1.4
|View full text |Cite
|
Sign up to set email alerts
|

A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

Abstract: Watching instructional videos are often used to learn about procedures. Video captioning is one way of automatically collecting such knowledge. However, it provides only an indirect, overall evaluation of multimodal models with no finer-grained quantitative measure of what they have learned. We propose instead, a benchmark of structured procedural knowledge extracted from cooking videos. This work is complementary to existing tasks, but requires models to produce interpretable structured knowledge in the form … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(2 citation statements)
references
References 49 publications
(52 reference statements)
0
2
0
Order By: Relevance
“…On the question of multimodal grounding, the computer vision and natural language processing (NLP) communities have drawn closer together, such that datasets originating in computer vision (e.g., Goyal et al, 2017 ; Damen et al, 2018 ; Boggust et al, 2019 ) now have demonstrated utility as benchmarks for NLP grounding tasks (e.g., Gella and Keller, 2017 ; Huang et al, 2020 ; Xu et al, 2020 ). One such popular challenge is grounding words to actions in images and video (e.g., Radford et al, 2021 ).…”
Section: Introductionmentioning
confidence: 99%
“…On the question of multimodal grounding, the computer vision and natural language processing (NLP) communities have drawn closer together, such that datasets originating in computer vision (e.g., Goyal et al, 2017 ; Damen et al, 2018 ; Boggust et al, 2019 ) now have demonstrated utility as benchmarks for NLP grounding tasks (e.g., Gella and Keller, 2017 ; Huang et al, 2020 ; Xu et al, 2020 ). One such popular challenge is grounding words to actions in images and video (e.g., Radford et al, 2021 ).…”
Section: Introductionmentioning
confidence: 99%
“…On the question of multimodal grounding, the computer vision and natural language processing (NLP) communities have drawn closer together, such that datasets originating in computer vision (e.g., (Goyal et al, 2017;Damen et al, 2018;Boggust et al, 2019)) now have demonstrated utility as benchmarks for NLP grounding tasks (e.g., (Gella & Keller, 2017;Huang et al, 2020;Xu et al, 2020)). One such popular challenge is grounding words to actions in images and video (e.g., (Radford et al, 2021)).…”
Section: Introductionmentioning
confidence: 99%