Movie/Script: Alignment and Parsing of Video and Text Transcription

Cour, Timothée; Jordan, Chris; Miltsakaki, Eleni; Taskar, Ben

doi:10.1007/978-3-540-88693-8_12

Cited by 98 publications

(81 citation statements)

References 14 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Typical mistakes contained in scripts marked in red italic (Cour et al 2008;Duchenne et al 2009;Laptev et al 2008;Liang et al 2011;Marszalek et al 2009), but so far not for video description. The main reason for this is that automatic alignment frequently fails due to the discrepancy between the movie and the script.…”

Section: Figmentioning

confidence: 99%

Movie Description

et al. 2017

View full text Add to dashboard Cite

Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. We introduce the Large Scale Movie Description Challenge (LSMDC) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies (around 150 h of video in total).

show abstract

Section: Figmentioning

confidence: 99%

Movie Description

et al. 2017

View full text Add to dashboard Cite

show abstract

“…Some studies also considered dynamic scenes. [2] studied the aligning of screen plays and videos, [15] learned and recognized simple human movement actions in movies, and [10] studied how to automatically label videos using a compositional model based on AND-OR-graphs that was trained on the highly structured domain of baseball videos The work of [5] attempts to "generate" sentences by first learning from a set of human annotated examples, and producing the same sentence if both images and sentence share common properties in terms of their triplets: (Nouns-Verbs-Scenes). No attempt was made to generate novel sentences from images beyond what has been annotated by humans.…”

Section: Related Workmentioning

confidence: 99%

Synergistic methods for using language in robotics

Teo

Yang

Fermüller

et al. 2012

Proceedings of the Workshop on Performance Metrics for Intelligent Systems

View full text Add to dashboard Cite

This paper presents an overview of our work on integrating language with vision to endow robots with the ability of complex scene understanding. We propose and motivate the Vision-ActionLanguage loop as a form of cognitive dialogue that enables us to integrate current tools in linguistics, vision and AI. We present several experimental results of preliminary implementation and discuss future research directions that we view as crucial for developing the cognitive robots of the future.

show abstract

“…Such a system can visually discover which actions are performed and also permits to collect training data for action recognition. Following recent advances on action recognition in realistic videos [5,15,16,18], we use movies and their transcripts to obtain video samples of visual actions. Related work of Cour et al [5] focuses on temporal segmentation of TV series in a hierarchy of shots, threads and scenes and on character naming, while in [15,16,18] the authors address the task of action classification.…”

Section: Introductionmentioning

confidence: 99%

Mining visual actions from movies

Gaidon¹,

Marszałek²,

Schmid³

2009

Procedings of the British Machine Vision Conference 2009

View full text Add to dashboard Cite

This paper presents an approach for mining visual actions from real-world videos. Given a large number of movies, we want to automatically extract short video sequences corresponding to visual human actions. First, we find commonly occurring actions by mining verbs extracted from movie transcripts. Next, we align the transcripts with the videos using subtitles. We then retrieve video samples for each action of interest. Not all of these samples visually characterize the action. Therefore, we propose to rank the retrieved videos by visual consistency. We first explore two unsupervised outlier detection methods: one-class Support Vector Machines (SVM) and finding the densest component of a similarity graph. As an alternative, we show how to obtain and use weak supervision. We investigate a direct use of binary SVM and propose a novel iterative re-training scheme for Support Vector Regression machines (SVR). Experimental results explore actions in 144 episodes of the TV series Buffy the Vampire Slayer and show: (a) the applicability of our approach to a large set of real-world videos, (b) how to use visual consistency for ranking videos retrieved from text, (c) the added value of random nonaction samples, i.e., the importance of weak supervision and (d) the ability of our iterative SVR re-training algorithm to handle mistakes in the weak supervision. The quality of the rankings obtained is assessed on manually annotated data for six different action classes.

show abstract

Movie/Script: Alignment and Parsing of Video and Text Transcription

Cited by 98 publications

References 14 publications

Movie Description

Movie Description

Synergistic methods for using language in robotics

Mining visual actions from movies

Contact Info

Product

Resources

About