2008
DOI: 10.1007/978-3-540-88693-8_12
|View full text |Cite
|
Sign up to set email alerts
|

Movie/Script: Alignment and Parsing of Video and Text Transcription

Abstract: Movies and TV are a rich source of diverse and complex video of people, objects, actions and locales "in the wild". Harvesting automatically labeled sequences of actions from video would enable creation of large-scale and highly-varied datasets. To enable such collection, we focus on the task of recovering scene structure in movies and TV series for object tracking and action retrieval. We present a weakly supervised algorithm that uses the screenplay and closed captions to parse a movie into a hierarchy of sh… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
78
0
3

Year Published

2009
2009
2020
2020

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 98 publications
(81 citation statements)
references
References 14 publications
(20 reference statements)
0
78
0
3
Order By: Relevance
“…Typical mistakes contained in scripts marked in red italic (Cour et al 2008;Duchenne et al 2009;Laptev et al 2008;Liang et al 2011;Marszalek et al 2009), but so far not for video description. The main reason for this is that automatic alignment frequently fails due to the discrepancy between the movie and the script.…”
Section: Figmentioning
confidence: 99%
“…Typical mistakes contained in scripts marked in red italic (Cour et al 2008;Duchenne et al 2009;Laptev et al 2008;Liang et al 2011;Marszalek et al 2009), but so far not for video description. The main reason for this is that automatic alignment frequently fails due to the discrepancy between the movie and the script.…”
Section: Figmentioning
confidence: 99%
“…Some studies also considered dynamic scenes. [2] studied the aligning of screen plays and videos, [15] learned and recognized simple human movement actions in movies, and [10] studied how to automatically label videos using a compositional model based on AND-OR-graphs that was trained on the highly structured domain of baseball videos The work of [5] attempts to "generate" sentences by first learning from a set of human annotated examples, and producing the same sentence if both images and sentence share common properties in terms of their triplets: (Nouns-Verbs-Scenes). No attempt was made to generate novel sentences from images beyond what has been annotated by humans.…”
Section: Related Workmentioning
confidence: 99%
“…Such a system can visually discover which actions are performed and also permits to collect training data for action recognition. Following recent advances on action recognition in realistic videos [5,15,16,18], we use movies and their transcripts to obtain video samples of visual actions. Related work of Cour et al [5] focuses on temporal segmentation of TV series in a hierarchy of shots, threads and scenes and on character naming, while in [15,16,18] the authors address the task of action classification.…”
Section: Introductionmentioning
confidence: 99%