2014
DOI: 10.1007/978-3-319-11752-2_15
|View full text |Cite
|
Sign up to set email alerts
|

Coherent Multi-sentence Video Description with Variable Level of Detail

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
160
0

Year Published

2017
2017
2021
2021

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 177 publications
(161 citation statements)
references
References 18 publications
1
160
0
Order By: Relevance
“…While recently several large datasets have been released which provide images with descriptions (Young et al 2014;Lin et al 2014;Ordonez et al 2011), video description datasets focus on short video clips with single sentence descriptions and have a limited number of video clips (Xu et al 2016;Chen and Dolan 2011) or are not publicly available (Over et al 2012). TACoS Multi-Level (Rohrbach et al 2014) and YouCook (Das et al 2013) are exceptions as they provide multiple sentence descriptions and longer videos. While these corpora pose challenges in terms of fine-grained recognition, they are restricted to the cooking scenario.…”
Section: Scriptmentioning
confidence: 99%
“…While recently several large datasets have been released which provide images with descriptions (Young et al 2014;Lin et al 2014;Ordonez et al 2011), video description datasets focus on short video clips with single sentence descriptions and have a limited number of video clips (Xu et al 2016;Chen and Dolan 2011) or are not publicly available (Over et al 2012). TACoS Multi-Level (Rohrbach et al 2014) and YouCook (Das et al 2013) are exceptions as they provide multiple sentence descriptions and longer videos. While these corpora pose challenges in terms of fine-grained recognition, they are restricted to the cooking scenario.…”
Section: Scriptmentioning
confidence: 99%
“…We call this process sentence directed video object codiscovery. It can be viewed as the inverse of video captioning/description (Barbu et al 2012;Das et al 2013;Guadarrama et al 2013;Rohrbach et al 2014;Venugopalan et al 2015;Yu et al 2015Yu et al , 2016 where object evidence (in the form of detections or other visual features) is first produced by pretrained detectors and then sentences are generated given the object appearance and movement.…”
Section: Figmentioning
confidence: 99%
“…To begin with, the image content was represented by features, such as colour information (33,41,51), texture (41,51), and detected edges (33). Image features were then replaced with an abstract representation, essentially a set of description words based on a visual-to-textual representation dictionary.…”
Section: Related Workmentioning
confidence: 99%
“…Image features were then replaced with an abstract representation, essentially a set of description words based on a visual-to-textual representation dictionary. For certain applications, objects were detected and recognised using some prior knowledge to supply higher level features (33,41,51,1). Thomason et al (41) presented description of objects based on semantic relations.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation