2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00443
|View full text |Cite
|
Sign up to set email alerts
|

Video Captioning via Hierarchical Reinforcement Learning

Abstract: Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short video, it is still very challenging to caption a video containing multiple fine-grained actions with a detailed description. This paper aims to address the challenge by proposing a novel hierarchical reinforcement learning framework for video captioning, where a highlevel Mana… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
133
0
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 219 publications
(134 citation statements)
references
References 47 publications
0
133
0
1
Order By: Relevance
“…Initial work has already demonstrated the benefits of combining reinforcement learning with RNNs to play Atari ® games 145 . Promising results have also been obtained for visual tracking, 146,147 face recognition, 148 action recognition, 149,150 video captioning, 151 color enhancement, 152 and object detection 153,154 …”
Section: The Role Of Recurrence Beyond Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…Initial work has already demonstrated the benefits of combining reinforcement learning with RNNs to play Atari ® games 145 . Promising results have also been obtained for visual tracking, 146,147 face recognition, 148 action recognition, 149,150 video captioning, 151 color enhancement, 152 and object detection 153,154 …”
Section: The Role Of Recurrence Beyond Recognitionmentioning
confidence: 99%
“…Initial work has already demonstrated the benefits of combining reinforcement learning with RNNs to play Atari R games. 145 Promising results have also been obtained for visual tracking, 146,147 face recognition, 148 action recognition, 149,150 video captioning, 151 color enhancement, 152 and object detection. 153,154 Another approach to learning structure in the visual world, which does not use explicit labeled examples or a teacher and provides direct rewards/punishment for specific actions, is based on the intuition that predicting what will happen next may be an important principle of computation in the brain.…”
Section: Learning and Plasticitymentioning
confidence: 99%
“…Vision-and-Language Grounding There is much prior work in the intersection of computer vision and natural language processing [42,23,27,21]. A highly related class of tasks centers around generating captions for images and videos [12,13,37,38,44]. In Visual Question Answering [3,43] and Visual Dialog [9], models generate single-turn and multi-turn responses by co-grounding vision and language.…”
Section: Related Workmentioning
confidence: 99%
“…(Shen et al, 2017;Gan et al, 2017) adopt multi-label learning with weak supervision to extract semantic features of video data. (Wang et al, 2018b) proposes optimizing the metrics directly with hierarchical reinforcement learning. extracts five types of features to develop the multimodal video captioning method and achieves promising results.…”
Section: Video Captioningmentioning
confidence: 99%