Move Forward and Tell: A Progressive Generator of Video Descriptions

Xiong, Yilei; Dai, Bo; Lin, Dahua

doi:10.1007/978-3-030-01252-6_29

Cited by 98 publications

(94 citation statements)

References 36 publications

(70 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For a fair comparison, we use exactly the same frame-wise feature from this work for our temporal attention module. For video paragraph description, we compare our methods against the SotA method MFT [31] with the evaluation script provided by the authors [31]. For image captioning, we compare against two SotA methods, Neural Baby Talk (NBT) [16] and BUTD [1].…”

Section: Compared Methods and Metricsmentioning

confidence: 99%

Grounded Video Description

Zhou

Kalantidis

Chen

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

174

155

View full text Add to dashboard Cite

Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the video by annotating each noun phrase in a sentence with the corresponding bounding box in one of the frames of a video. Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe. To generate grounded captions, we propose a novel video description model which is able to exploit these bounding box annotations. We demonstrate the effectiveness of our model on our dataset, but also show how it can be applied to image description on the Flickr30k Entities dataset. We achieve state-of-the-art performance on video description, video paragraph description, and image description and demonstrate our generated sentences are better grounded in the video.

show abstract

Section: Compared Methods and Metricsmentioning

confidence: 99%

Grounded Video Description

Zhou

Kalantidis

Chen

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

174

155

View full text Add to dashboard Cite

show abstract

“…Recall (@tIoU) Precision (@tIoU) @0.3 @0.5 @0.7 @0.9 Average @0.3 @0.5 @0.7 @0.9 Average MFT [30] 46. 18 respectively.…”

Section: Methodsmentioning

confidence: 99%

“…Proposal the results of MFT [30], which is originally proposed for video paragraph generation but its event selection module is also able to generate an event sequence from the candidate event proposals; it makes a choice between selecting each proposal for caption generation and skipping it, and constructs an event sequence implicitly. For MFT, we compare performances in both event detection and dense captioning.…”

Section: Methodsmentioning

confidence: 99%

“…Despite their impressive performances, they are limited to describing a video using a single sentence and can be applied only to a short video containing a single event. Thus, Yu et al [35] propose a hierarchical recurrent neural network to generate a paragraph for a long video, while Xiong et al [30] introduce a paragraph generation method based on event proposals, where an event selection module determines which proposals need to be utilized for caption generation in a progressive way. Contrary to these tasks, which simply generate a sentence or paragraph for an input video, dense video captioning requires localizing and describing events at the same time.…”

Section: Video Captioningmentioning

confidence: 99%

“…Understanding video contents is an important topic in computer vision. Through the introduction of large-scale datasets [9,31] and the recent advances of deep learning technology, research towards video content understanding is no longer limited to activity classification or detection and addresses more complex tasks including video caption generation [1,4,13,14,15,22,23,26,28,30,33,35,36].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Streamlined Dense Video Captioning

Mun

Yang

Ren³

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

124

View full text Add to dashboard Cite

Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first detecting event proposals from a video and then captioning on a subset of the proposals. As a result, the generated sentences are prone to be redundant or inconsistent since they fail to consider temporal dependency between events. To tackle this challenge, we propose a novel dense video captioning framework, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. This objective is achieved by 1) integrating an event sequence generation network to select a sequence of event proposals adaptively, and 2) feeding the sequence of event proposals to our sequential video captioning network, which is trained by reinforcement learning with two-level rewards-at both event and episode levels-for better context modeling. The proposed technique achieves outstanding performances on Ac-tivityNet Captions dataset in most metrics.

show abstract

Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

Wajid,

Terashima‐Marin,

Najafirad

et al. 2023

Engineering Reports

View full text Add to dashboard Cite

Generating an image/video caption has always been a fundamental problem of Artificial Intelligence, which is usually performed using the potential of Deep Learning Methods, Computer Vision, Knowledge Graphs, and Natural Language Processing (NLP). The significant task of image/video captioning is to describe visual content in terms of natural language. Due to a semantic gap, this presents a massive problem in understanding and explaining images or videos syntactically and semantically. The current systems need somewhere to fill the gap between low‐level and high‐level features while mapping. Therefore, to tackle this problem, there is a need to describe the latest research and methods to overcome difficulties and to propose effective solutions. This work thoroughly analyses and investigates the most related methods (deep learning and knowledge graph‐based approaches), benchmark datasets, and evaluation metrics with their benefits and limitations. Here we have also reviewed the state‐of‐the‐art methods related to image/video captioning and their applications in the current scenario. Finally, we provide thorough information on existing research with comparisons of results on benchmark datasets. We have also mentioned the existing challenges and future direction of research.

show abstract

Move Forward and Tell: A Progressive Generator of Video Descriptions

Cited by 98 publications

References 36 publications

Grounded Video Description

Grounded Video Description

Streamlined Dense Video Captioning

Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

Contact Info

Product

Resources

About