2018
DOI: 10.1007/978-3-030-01252-6_29
|View full text |Cite
|
Sign up to set email alerts
|

Move Forward and Tell: A Progressive Generator of Video Descriptions

Abstract: We present an efficient framework that can generate a coherent paragraph to describe a given video. Previous works on video captioning usually focus on video clips. They typically treat an entire video as a whole and generate the caption conditioned on a single embedding. On the contrary, we consider videos with rich temporal structures and aim to generate paragraph descriptions that can preserve the story flow while being coherent and concise. Towards this goal, we propose a new approach, which produces a des… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
94
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 98 publications
(94 citation statements)
references
References 36 publications
(70 reference statements)
0
94
0
Order By: Relevance
“…For a fair comparison, we use exactly the same frame-wise feature from this work for our temporal attention module. For video paragraph description, we compare our methods against the SotA method MFT [31] with the evaluation script provided by the authors [31]. For image captioning, we compare against two SotA methods, Neural Baby Talk (NBT) [16] and BUTD [1].…”
Section: Compared Methods and Metricsmentioning
confidence: 99%
“…For a fair comparison, we use exactly the same frame-wise feature from this work for our temporal attention module. For video paragraph description, we compare our methods against the SotA method MFT [31] with the evaluation script provided by the authors [31]. For image captioning, we compare against two SotA methods, Neural Baby Talk (NBT) [16] and BUTD [1].…”
Section: Compared Methods and Metricsmentioning
confidence: 99%
“…Recall (@tIoU) Precision (@tIoU) @0.3 @0.5 @0.7 @0.9 Average @0.3 @0.5 @0.7 @0.9 Average MFT [30] 46. 18 respectively.…”
Section: Methodsmentioning
confidence: 99%
“…Proposal the results of MFT [30], which is originally proposed for video paragraph generation but its event selection module is also able to generate an event sequence from the candidate event proposals; it makes a choice between selecting each proposal for caption generation and skipping it, and constructs an event sequence implicitly. For MFT, we compare performances in both event detection and dense captioning.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations