2015 IEEE International Conference on Computer Vision (ICCV) 2015
DOI: 10.1109/iccv.2015.509
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Semantic Parsing of Video Collections

Abstract: Human communication takes many forms, including speech, text and instructional videos. It typically has an underlying structure, with a starting point, ending, and certain objective steps between them. In this paper, we consider instructional videos where there are tens of millions of them on the Internet.We propose a method for parsing a video into such semantic steps in an unsupervised way. Our method is capable of providing a semantic "storyline" of the video composed of its objective steps. We accomplish t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
92
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 95 publications
(92 citation statements)
references
References 52 publications
(70 reference statements)
0
92
0
Order By: Relevance
“…However as opposed to our work, these works typically extract from transcriptions only a small number of predefined labels. Numerous datasets of web instructional videos were proposed over the past years [2,30,45,47,50,67,68]. Among the first to harvest instructional videos, Sener et al [47] use WikiHow, an encyclopedia of how to articles, to collect 17 popular physical tasks, and obtain videos by querying these tasks on YouTube.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…However as opposed to our work, these works typically extract from transcriptions only a small number of predefined labels. Numerous datasets of web instructional videos were proposed over the past years [2,30,45,47,50,67,68]. Among the first to harvest instructional videos, Sener et al [47] use WikiHow, an encyclopedia of how to articles, to collect 17 popular physical tasks, and obtain videos by querying these tasks on YouTube.…”
Section: Related Workmentioning
confidence: 99%
“…Numerous datasets of web instructional videos were proposed over the past years [2,30,45,47,50,67,68]. Among the first to harvest instructional videos, Sener et al [47] use WikiHow, an encyclopedia of how to articles, to collect 17 popular physical tasks, and obtain videos by querying these tasks on YouTube. In a similar vein, COIN [50] and CrossTask [68] datasets are collected by first searching for tasks on WikiHow and then videos for each task on YouTube.…”
Section: Related Workmentioning
confidence: 99%
“…Among them, instructional videos provide more intuitive visual examples, and will be focused on in this paper. With the explosion of video data on the Internet, people around the world have uploaded and watched substantial instructional videos [6], [59], covering miscellaneous categories. As suggested by the scientists in educational psychology [54], novices often face difficulties in learning from the whole realistic task, and it is necessary to divide the whole task into smaller segments or steps as a form of simplification.…”
Section: Introductionmentioning
confidence: 99%
“…Accordingly, a variety of relative tasks have been studied by morden computer vision community in recent years (e.g., action temporal localization [74], [80], video summarization [23], [49], [79] and video caption [35], [77], [83], etc). Also, increasing efforts have been devoted to exploring different challenges of instructional video analysis [6], [31], [59], [82] evidence, Fig. 2 shows the growing number of publications in the top venues over the recent ten years.…”
Section: Introductionmentioning
confidence: 99%
“…There are also cases where no ASR tokens are available at all. Despite these potential difficulties, previous work has demonstrated that ASR can be informative in a variety of instructional video understanding tasks (Naim et al, 2014(Naim et al, , 2015Malmaud et al, 2015;Sener et al, 2015;Alayrac et al, 2016;; though less work has focused on instructional caption generation, which is known to be difficult and sensitive to input perturbations (Chen et al, 2018).…”
Section: Introductionmentioning
confidence: 99%