Mise en Place: Unsupervised Interpretation of Instructional Recipes

Kiddon, Chloé; Ponnuraj, Ganesa Thandavam; Zettlemoyer, Luke; Choi, Yejin

doi:10.18653/v1/d15-1114

Cited by 78 publications

(87 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We verify our approach using unstructured instructional videos readily available on YouTube [35]. By jointly optimizing on over two thousand YouTube instructional videos with no reference annotation, our joint visual-linguistic model improves 9% on both the precision and recall of reference resolution over the state-of-the-art linguistic-only model [23]. We further show that resolving reference is important to aligning unstructured speech transcriptions to videos, which are usually not perfectly aligned.…”

Section: Introductionmentioning

confidence: 76%

Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

Huang

Lim

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We propose an unsupervised method for reference resolution in instructional videos, where the goal is to temporally link an entity (e.g., "dressing") to the action (e.g., "mix yogurt") that produced it. The key challenge is the inevitable visual-linguistic ambiguities arising from the changes in both visual appearance and referring expression of an entity in the video. This challenge is amplified by the fact that we aim to resolve references with no supervision. We address these challenges by learning a joint visuallinguistic model, where linguistic cues can help resolve visual ambiguities and vice versa. We verify our approach by learning our model unsupervisedly using more than two thousand unstructured cooking videos from YouTube, and show that our visual-linguistic model can substantially improve upon state-of-the-art linguistic only model on reference resolution in instructional videos.

show abstract

Section: Introductionmentioning

confidence: 76%

Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

Huang

Lim

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…Localizing Video Segments with Natural Language. Prior work has considered aligning natural language with video, e.g., instructional videos with transcribed text (Kiddon et al, 2015;Huang et al, 2017;Malmaud et al, 2014Malmaud et al, , 2015. Our work is most related to recent work in video moment retrieval with natural language (Gao et al, 2017;Hendricks et al, 2017).…”

Section: Related Workmentioning

confidence: 99%

Localizing Moments in Video with Temporal Language

Hendricks¹,

Wang²,

Shechtman³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

115

107

View full text Add to dashboard Cite

Localizing moments in a longer video via natural language queries is a new, challenging task at the intersection of language and video understanding. Though moment localization with natural language is similar to other language and vision tasks like natural language object retrieval in images, moment localization offers an interesting opportunity to model temporal dependencies and reasoning in text.We propose a new model that explicitly reasons about different temporal segments in a video, and shows that temporal context is important for localizing phrases which include temporal language. To benchmark whether our model, and other recent video localization models, can effectively reason about temporal language, we collect the novel TEMPOral reasoning in video and language (TEMPO) dataset. Our dataset consists of two parts: a dataset with real videos and template sentences (TEMPO -Template Language) which allows for controlled studies on temporal language, and a human language dataset which consists of temporal sentences annotated by humans (TEMPO -Human Language).

show abstract

“…Finally, our teachers can be seen as rewarding generators that approximate script patterns in recipes. Previous work in learning script knowledge (Schank and Abelson, 1975) has focused on extracting scripts from long texts (Chambers and Jurafsky, 2009;Pichotta and Mooney, 2016), with some of that work focusing on recipes (Kiddon et al, 2015;Mori et al, 2014Mori et al, , 2012. Our teachers implicitly learn this script knowledge and reward recipe generators for exhibiting it.…”

Section: Related Workmentioning

confidence: 99%

Discourse-Aware Neural Rewards for Coherent Text Generation

Bosselut¹,

Çelikyılmaz²,

He³

et al. 2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

Self Cite

View full text Add to dashboard Cite

In this paper, we investigate the use of discourse-aware rewards with reinforcement learning to guide a model to generate long, coherent text. In particular, we propose to learn neural rewards to model cross-sentence ordering as a means to approximate desired discourse structure. Empirical results demonstrate that a generator trained with the learned reward produces more coherent and less repetitive text than models trained with crossentropy or with reinforcement learning with commonly used scores as rewards.

show abstract

Mise en Place: Unsupervised Interpretation of Instructional Recipes

Cited by 78 publications

References 22 publications

Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

Localizing Moments in Video with Temporal Language

Discourse-Aware Neural Rewards for Coherent Text Generation

Contact Info

Product

Resources

About