Unsupervised Semantic Parsing of Video Collections

Şener, Ozan; Zamir, Amir; Savarese, Silvio; Saxena, Ashutosh

doi:10.1109/iccv.2015.509

Cited by 95 publications

(92 citation statements)

References 52 publications

(70 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However as opposed to our work, these works typically extract from transcriptions only a small number of predefined labels. Numerous datasets of web instructional videos were proposed over the past years [2,30,45,47,50,67,68]. Among the first to harvest instructional videos, Sener et al [47] use WikiHow, an encyclopedia of how to articles, to collect 17 popular physical tasks, and obtain videos by querying these tasks on YouTube.…”

Section: Related Workmentioning

confidence: 99%

“…Numerous datasets of web instructional videos were proposed over the past years [2,30,45,47,50,67,68]. Among the first to harvest instructional videos, Sener et al [47] use WikiHow, an encyclopedia of how to articles, to collect 17 popular physical tasks, and obtain videos by querying these tasks on YouTube. In a similar vein, COIN [50] and CrossTask [68] datasets are collected by first searching for tasks on WikiHow and then videos for each task on YouTube.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech¹,

Zhukov²,

Alayrac³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

693

861

View full text Add to dashboard Cite

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-ofthe-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models are publicly available [1].

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech¹,

Zhukov²,

Alayrac³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

693

861

View full text Add to dashboard Cite

show abstract

“…Among them, instructional videos provide more intuitive visual examples, and will be focused on in this paper. With the explosion of video data on the Internet, people around the world have uploaded and watched substantial instructional videos [6], [59], covering miscellaneous categories. As suggested by the scientists in educational psychology [54], novices often face difficulties in learning from the whole realistic task, and it is necessary to divide the whole task into smaller segments or steps as a form of simplification.…”

Section: Introductionmentioning

confidence: 99%

“…Accordingly, a variety of relative tasks have been studied by morden computer vision community in recent years (e.g., action temporal localization [74], [80], video summarization [23], [49], [79] and video caption [35], [77], [83], etc). Also, increasing efforts have been devoted to exploring different challenges of instructional video analysis [6], [31], [59], [82] evidence, Fig. 2 shows the growing number of publications in the top venues over the recent ten years.…”

Section: Introductionmentioning

confidence: 99%

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Tang

Lu²,

Zhou³

2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Thanks to the substantial and explosively inscreased instructional videos on the Internet, novices are able to acquire knowledge for completing various tasks. Over the past decade, growing efforts have been devoted to investigating the problem on instructional video analysis. However, the most existing datasets in this area have limitations in diversity and scale, which makes them far from many real-world applications where more diverse activities occur. To address this, we present a large-scale dataset named as "COIN" for COmprehensive INstructional video analysis. Organized with a hierarchical structure, the COIN dataset contains 11,827 videos of 180 tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life. With a new developed toolbox, all the videos are annotated efficiently with a series of step labels and the corresponding temporal boundaries. In order to provide a benchmark for instructional video analysis, we evaluate plenty of approaches on the COIN dataset under five different settings. Furthermore, we exploit two important characteristics (i.e., task-consistency and ordering-dependency) for localizing important steps in instructional videos. Accordingly, we propose two simple yet effective methods, which can be easily plugged into conventional proposal-based action detection models. We believe the introduction of the COIN dataset will promote the future in-depth research on instructional video analysis for the community. Our dataset, annotation toolbox and source code are available at http://coin-dataset.github.io. DomainTaskStep VehiclesHousehold Items Change the Car Tire {unscrew the screws, jack up the car, remove the tire, put on the tire, tighten the screws } {remove the door knob, remove bolt and pin board, install new pin board, install new bolt, install new door knob } Replace the Door Knob

show abstract

“…There are also cases where no ASR tokens are available at all. Despite these potential difficulties, previous work has demonstrated that ASR can be informative in a variety of instructional video understanding tasks (Naim et al, 2014(Naim et al, , 2015Malmaud et al, 2015;Sener et al, 2015;Alayrac et al, 2016;; though less work has focused on instructional caption generation, which is known to be difficult and sensitive to input perturbations (Chen et al, 2018).…”

Section: Introductionmentioning

confidence: 99%

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Hessel¹,

Pang²,

Zhu³

et al. 2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

View full text Add to dashboard Cite

Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e.g., "heat the oil in the pan") improves user experiences. However, current automatic annotation methods based on visual features alone perform only slightly better than constant prediction. Taking cues from prior work, we show that we can improve performance significantly by considering automatic speech recognition (ASR) tokens as input. Furthermore, jointly modeling ASR tokens and visual features results in higher performance compared to training individually on either modality. We find that unstated background information is better explained by visual features, whereas fine-grained distinctions (e.g., "add oil" vs. "add olive oil") are disambiguated more easily via ASR tokens.

show abstract

Unsupervised Semantic Parsing of Video Collections

Cited by 95 publications

References 52 publications

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Contact Info

Product

Resources

About