Coherent Multi-sentence Video Description with Variable Level of Detail

Rohrbach, Anna; Rohrbach, Marcus; Qiu, Wei; Friedrich, Annemarie; Pinkal, Manfred; Schiele, Bernt

doi:10.1007/978-3-319-11752-2_15

Cited by 177 publications

(161 citation statements)

References 18 publications

Supporting

Mentioning

160

Contrasting

Order By: Relevance

“…While recently several large datasets have been released which provide images with descriptions (Young et al 2014;Lin et al 2014;Ordonez et al 2011), video description datasets focus on short video clips with single sentence descriptions and have a limited number of video clips (Xu et al 2016;Chen and Dolan 2011) or are not publicly available (Over et al 2012). TACoS Multi-Level (Rohrbach et al 2014) and YouCook (Das et al 2013) are exceptions as they provide multiple sentence descriptions and longer videos. While these corpora pose challenges in terms of fine-grained recognition, they are restricted to the cooking scenario.…”

Section: Scriptmentioning

confidence: 99%

Movie Description

et al. 2017

Self Cite

View full text Add to dashboard Cite

Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. We introduce the Large Scale Movie Description Challenge (LSMDC) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies (around 150 h of video in total).

show abstract

Section: Scriptmentioning

confidence: 99%

Movie Description

et al. 2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…We call this process sentence directed video object codiscovery. It can be viewed as the inverse of video captioning/description (Barbu et al 2012;Das et al 2013;Guadarrama et al 2013;Rohrbach et al 2014;Venugopalan et al 2015;Yu et al 2015Yu et al , 2016 where object evidence (in the form of detections or other visual features) is first produced by pretrained detectors and then sentences are generated given the object appearance and movement.…”

Section: Figmentioning

confidence: 99%

Sentence Directed Video Object Codiscovery

Yu¹,

Siskind

2017

Int J Comput Vis

View full text Add to dashboard Cite

Video object codiscovery can leverage the weak semantic constraint implied by sentences that describe the video content. Our codiscovery method, like other object codetection techniques, does not employ any pretrained object models or detectors. Unlike most prior work that focuses on codetecting large objects which are usually salient both in size and appearance, our method can discover small or medium sized objects as well as ones that may be occluded for part of the video. More importantly, our method can codiscover multiple object instances of different classes within a single video clip. Although the semantic information employed is usually simple and weak, it can greatly boost performance by constraining the hypothesized object locations. Experiments show promising results on three datasets: an average IoU score of 0.423 on a new dataset with 15 object

show abstract

“…To begin with, the image content was represented by features, such as colour information (33,41,51), texture (41,51), and detected edges (33). Image features were then replaced with an abstract representation, essentially a set of description words based on a visual-to-textual representation dictionary.…”

Section: Related Workmentioning

confidence: 99%

“…Image features were then replaced with an abstract representation, essentially a set of description words based on a visual-to-textual representation dictionary. For certain applications, objects were detected and recognised using some prior knowledge to supply higher level features (33,41,51,1). Thomason et al (41) presented description of objects based on semantic relations.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Generating natural language tags for video information management

Khan

Gotoh

2017

Machine Vision and Applications

View full text Add to dashboard Cite

This exploratory work is concerned with generation of natural language descriptions that can be used for video retrieval applications. It is a step ahead of keyword based tagging as it captures relations between keywords associated with videos. Firstly we prepare hand annotations consisting of descriptions for video segments crafted from a TREC Video dataset. Analysis of this data presents insights into human's interests on video contents. Secondly we develop a framework for creating smooth and coherent description of video streams. It builds on conventional image processing techniques that extract high level features from individual video frames. Natural language description is then produced based on high level features. Although feature extraction processes are erroneous at various levels, we explore approaches to putting them together to produce a coherent, smooth and well phrased description by incorporating spatial and temporal information. Evaluation is made by calculating ROUGE scores between human annotated and machine generated descriptions. Further we introduce a task based evaluation by human subjects which provides qualitative evaluation of generated descriptions.

show abstract

Coherent Multi-sentence Video Description with Variable Level of Detail

Cited by 177 publications

References 18 publications

Movie Description

Movie Description

Sentence Directed Video Object Codiscovery

Generating natural language tags for video information management

Contact Info

Product

Resources

About