A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

Xu, Frank F.; Ji, Lei; Shi, Botian; Du, Junyi; Neubig, Graham; Bisk, Yonatan; Duan, Nan

doi:10.18653/v1/2020.nlpbt-1.4

Cited by 9 publications

(2 citation statements)

References 49 publications

(52 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the question of multimodal grounding, the computer vision and natural language processing (NLP) communities have drawn closer together, such that datasets originating in computer vision (e.g., Goyal et al, 2017 ; Damen et al, 2018 ; Boggust et al, 2019 ) now have demonstrated utility as benchmarks for NLP grounding tasks (e.g., Gella and Keller, 2017 ; Huang et al, 2020 ; Xu et al, 2020 ). One such popular challenge is grounding words to actions in images and video (e.g., Radford et al, 2021 ).…”

Section: Introductionmentioning

confidence: 99%

Grounding human-object interaction to affordance behavior in multimodal datasets

Henlein

Gopinath

Krishnaswamy

et al. 2023

Front. Artif. Intell.

View full text Add to dashboard Cite

While affordance detection and Human-Object interaction (HOI) detection tasks are related, the theoretical foundation of affordances makes it clear that the two are distinct. In particular, researchers in affordances make distinctions between J. J. Gibson's traditional definition of an affordance, “the action possibilities” of the object within the environment, and the definition of a telic affordance, or one defined by conventionalized purpose or use. We augment the HICO-DET dataset with annotations for Gibsonian and telic affordances and a subset of the dataset with annotations for the orientation of the humans and objects involved. We then train an adapted Human-Object Interaction (HOI) model and evaluate a pre-trained viewpoint estimation system on this augmented dataset. Our model, AffordanceUPT, is based on a two-stage adaptation of the Unary-Pairwise Transformer (UPT), which we modularize to make affordance detection independent of object detection. Our approach exhibits generalization to new objects and actions, can effectively make the Gibsonian/telic distinction, and shows that this distinction is correlated with features in the data that are not captured by the HOI annotations of the HICO-DET dataset.

show abstract

Section: Introductionmentioning

confidence: 99%

Grounding human-object interaction to affordance behavior in multimodal datasets

Henlein

Gopinath

Krishnaswamy

et al. 2023

Front. Artif. Intell.

View full text Add to dashboard Cite

show abstract

“…On the question of multimodal grounding, the computer vision and natural language processing (NLP) communities have drawn closer together, such that datasets originating in computer vision (e.g., (Goyal et al, 2017;Damen et al, 2018;Boggust et al, 2019)) now have demonstrated utility as benchmarks for NLP grounding tasks (e.g., (Gella & Keller, 2017;Huang et al, 2020;Xu et al, 2020)). One such popular challenge is grounding words to actions in images and video (e.g., (Radford et al, 2021)).…”

Section: Introductionmentioning

confidence: 99%

Toward context-based text-to-3D scene generation

Henlein¹

View full text Add to dashboard Cite

People can describe spatial scenes with language and, vice versa, create images based on linguistic descriptions. However, current systems do not even come close to matching the complexity of humans when it comes to reconstructing a scene from a given text. Even the ever-advancing development of better and better Transformer-based models has not been able to achieve this so far. This task, the automatic generation of a 3D scene based on an input text, is called text-to-3D scene generation. The key challenge, and focus of this dissertation, now relate to the following topics: (a) Analyses of how well current language models understand spatial information, how static embeddings compare, and whether they can be improved by anaphora resolution. (b) Automated resource generation for context expansion and grounding that can help in the creation of realistic scenes. (c) Creation of a VR-based text-to-3D scene system that can be used as an annotation and active-learning environment, but can also be easily extended in a modular way with additional features to solve more contexts in the future. (d) Analyze existing practices and tools for digital and virtual teaching, learning, and collaboration, as well as the conditions and strategies in the context of VR. In the first part of this work, we could show that static word embeddings do not benefit significantly from pronoun substitution. We explain this result by the loss of contextual information, the reduction in the relative occurrence of rare words, and the absence of pronouns to be substituted. But we were able to we have shown that both static and contextualizing language models appear to encode object knowledge, but require a sophisticated apparatus to retrieve it. The models themselves in combination with the measures differ greatly in terms of the amount of knowledge they allow to extract. Classifier-based variants perform significantly better than the unsupervised methods from bias research, but this is also due to overfitting. The resources generated for this evaluation are later also an important component of point three. In the second part, we present AffordanceUPT, a modularization of UPT trained on the HICO-DET dataset, which we have extended with Gibsonien/telic annotations. We then show that AffordanceUPT can effectively make the Gibsonian/telic distinction and that the model learns other correlations in the data to make such distinctions (e.g., the presence of hands in the image) that have important implications for grounding images to language. The third part first presents a VR project to support spatial annotation respectively IsoSpace. The direct spatial visualization and the immediate interaction with the 3D objects should make the labeling more intuitive and thus easier. The project will later be incorporated as part of the Semantic Scene Builder (SeSB). The project itself in turn relies on the Text2SceneVR presented here for generating spatial hypertext, which in turn is based on the VAnnotatoR. Finally, we introduce Semantic Scene Builder (SeSB), a VR-based text-to-3D scene framework using Semantic Annotation Framework (SemAF) as a scheme for annotating semantic relations. It integrates a wide range of tools and resources by utilizing SemAF and UIMA as a unified data structure to generate 3D scenes from textual descriptions and also supports annotations. When evaluating SeSB against another state-of-the-art tool, it was found that our approach not only performed better, but also allowed us to model a wider variety of scenes. The final part reviews existing practices and tools for digital and virtual teaching, learning, and collaboration, as well as the conditions and strategies needed to make the most of technological opportunities in the future.

show abstract

Instructional Video Summarization Using Attentive Knowledge Grounding

Lee

Hwang

2021

Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track

View full text Add to dashboard Cite

A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

Cited by 9 publications

References 49 publications

Grounding human-object interaction to affordance behavior in multimodal datasets

Grounding human-object interaction to affordance behavior in multimodal datasets

Toward context-based text-to-3D scene generation

Instructional Video Summarization Using Attentive Knowledge Grounding

Contact Info

Product

Resources

About