Video Captioning via Hierarchical Reinforcement Learning

Wang, Xin; Chen, Wenhu; Wu, Jiawei; Wang, Yuan-Fang; Wang, William Yang

doi:10.1109/cvpr.2018.00443

Cited by 219 publications

(134 citation statements)

References 47 publications

Supporting

Mentioning

133

Contrasting

Unclassified

Order By: Relevance

“…Initial work has already demonstrated the benefits of combining reinforcement learning with RNNs to play Atari ® games 145 . Promising results have also been obtained for visual tracking, 146,147 face recognition, 148 action recognition, 149,150 video captioning, 151 color enhancement, 152 and object detection 153,154 …”

Section: The Role Of Recurrence Beyond Recognitionmentioning

confidence: 99%

“…Initial work has already demonstrated the benefits of combining reinforcement learning with RNNs to play Atari R games. 145 Promising results have also been obtained for visual tracking, 146,147 face recognition, 148 action recognition, 149,150 video captioning, 151 color enhancement, 152 and object detection. 153,154 Another approach to learning structure in the visual world, which does not use explicit labeled examples or a teacher and provides direct rewards/punishment for specific actions, is based on the intuition that predicting what will happen next may be an important principle of computation in the brain.…”

Section: Learning and Plasticitymentioning

confidence: 99%

See 1 more Smart Citation

Beyond the feedforward sweep: feedback computations in the visual cortex

Kreiman

Serre

2020

Annals of the New York Academy of Sciences

View full text Add to dashboard Cite

Visual perception involves the rapid formation of a coarse image representation at the onset of visual processing, which is iteratively refined by late computational processes. These early versus late time windows approximately map onto feedforward and feedback processes, respectively. State‐of‐the‐art convolutional neural networks, the main engine behind recent machine vision successes, are feedforward architectures. Their successes and limitations provide critical information regarding which visual tasks can be solved by purely feedforward processes and which require feedback mechanisms. We provide an overview of recent work in cognitive neuroscience and machine vision that highlights the possible role of feedback processes for both visual recognition and beyond. We conclude by discussing important open questions for future research.

show abstract

Section: The Role Of Recurrence Beyond Recognitionmentioning

confidence: 99%

Section: Learning and Plasticitymentioning

confidence: 99%

Beyond the feedforward sweep: feedback computations in the visual cortex

Kreiman

Serre

2020

Annals of the New York Academy of Sciences

View full text Add to dashboard Cite

show abstract

“…Vision-and-Language Grounding There is much prior work in the intersection of computer vision and natural language processing [42,23,27,21]. A highly related class of tasks centers around generating captions for images and videos [12,13,37,38,44]. In Visual Question Answering [3,43] and Visual Dialog [9], models generate single-turn and multi-turn responses by co-grounding vision and language.…”

Section: Related Workmentioning

confidence: 99%

Transferable Representation Learning in Vision-and-Language Navigation

Huang

Jain

Mehta

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Vision-and-Language Navigation (VLN) tasks such as Room-to-Room (R2R) require machine agents to interpret natural language instructions and learn to act in visually realistic environments to achieve navigation goals. The overall task requires competence in several perception problems: successful agents combine spatio-temporal, vision and language understanding to produce appropriate action sequences. Our approach adapts pre-trained vision and language representations to relevant in-domain tasks making them more effective for VLN. Specifically, the representations are adapted to solve both a cross-modal sequence alignment and sequence coherence task. In the sequence alignment task, the model determines whether an instruction corresponds to a sequence of visual frames. In the sequence coherence task, the model determines whether the perceptual sequences are predictive sequentially in the instruction-conditioned latent space. By transferring the domain-adapted representations, we improve competitive agents in R2R as measured by the success rate weighted by path length (SPL) metric.

show abstract

“…(Shen et al, 2017;Gan et al, 2017) adopt multi-label learning with weak supervision to extract semantic features of video data. (Wang et al, 2018b) proposes optimizing the metrics directly with hierarchical reinforcement learning. extracts five types of features to develop the multimodal video captioning method and achieves promising results.…”

Section: Video Captioningmentioning

confidence: 99%

Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning

Jin

Huang

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

This paper addresses the challenging task of video captioning which aims to generate descriptions for video data. Recently, the attention-based encoder-decoder structures have been widely used in video captioning. In existing literature, the attention weights are often built from the information of an individual modality, while, the association relationships between multiple modalities are neglected. Motivated by this observation, we propose a video captioning model with High-Order Cross-Modal Attention (HOCA) where the attention weights are calculated based on the high-order correlation tensor to capture the frame-level cross-modal interaction of different modalities sufficiently. Furthermore, we novelly introduce Low-Rank HOCA which adopts tensor decomposition to reduce the extremely large space requirement of HOCA, leading to a practical and efficient implementation in real-world applications. Experimental results on two benchmark datasets, MSVD and MSR-VTT, show that Low-rank HOCA establishes a new state-of-the-art.

show abstract

Video Captioning via Hierarchical Reinforcement Learning

Cited by 219 publications

References 47 publications

Beyond the feedforward sweep: feedback computations in the visual cortex

Beyond the feedforward sweep: feedback computations in the visual cortex

Transferable Representation Learning in Vision-and-Language Navigation

Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning

Contact Info

Product

Resources

About