2020
DOI: 10.1109/tpami.2019.2920899
|View full text |Cite
|
Sign up to set email alerts
|

Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning

Abstract: In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) in a novel encoder-decoder-reconstructor architecture, which leverages both forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder component makes use of the forward flow to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
18
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
4
1

Relationship

3
7

Authors

Journals

citations
Cited by 61 publications
(19 citation statements)
references
References 62 publications
0
18
0
Order By: Relevance
“…J. Mun et al [74] proposed a framework where an event sequence generation network is used to monitor the series of events for generated captions from the video. W. Zhang et al [117] proposed a reconstruction network for a description of visual contents, which operates on both forward flow (from video to sentence) and backward flow (from sentence to video). W. Xu et al [108] proposed a polishing network that utilizes the RL technique to refine the generated captions.…”
Section: Deep Reinforcement Learning (Drl) Architecturesmentioning
confidence: 99%
“…J. Mun et al [74] proposed a framework where an event sequence generation network is used to monitor the series of events for generated captions from the video. W. Zhang et al [117] proposed a reconstruction network for a description of visual contents, which operates on both forward flow (from video to sentence) and backward flow (from sentence to video). W. Xu et al [108] proposed a polishing network that utilizes the RL technique to refine the generated captions.…”
Section: Deep Reinforcement Learning (Drl) Architecturesmentioning
confidence: 99%
“…This framework is trained in a supervised manner, while RL further enhances the model for better context modeling. W. Zhang et al [177] proposed a reconstruction network for a description of visual contents, which operates on both forward flow (from video to sentence) and backward flow (from sentence to video). Encoder-Decoder utilizes the forward flow for text description, and reconstructor (local and global) utilizes backward flow.…”
Section: Deep Reinforcement Learning (Drl) Architecturesmentioning
confidence: 99%
“…Early works [10,14,25,26,34,39,45] on video captioning adopted template-based methods, which first define some specific rules for language grammar, and then generate captions by associating words detected from visual contents with predefined sentence templates. Currently, with the development of deep neural networks, sequence learning methods with an encoder-decoder architecture [31] have been widely adopted for video captioning.…”
Section: Related Work 21 Video Captioningmentioning
confidence: 99%