What Value Do Explicit High Level Concepts Have in Vision to Language Problems?

Wu, Qi; Shen, Chunhua; Liu, Lingqiao; Dick, Anthony; Hengel, Anton van den

doi:10.1109/cvpr.2016.29

Cited by 395 publications

(285 citation statements)

References 38 publications

(86 reference statements)

Supporting

Mentioning

279

Contrasting

Unclassified

Order By: Relevance

“…In [10,42], the authors introduced a framework which utilizes a pre-trained CNN as an encoder to extract image features, followed by an RNN as a decoder to generate image descriptions. This model was further improved by incorporating high-level semantic attribute information [44,49] or regularizing the RNN decoder [6]. To distill the salient objects or important regions from an image, different kinds of attention mechanisms were integrated into the captioning framework to exam the relevant image regions when generating sentences [2,26,45,47,50].…”

Section: Related Workmentioning

confidence: 99%

Reflective Decoding Network for Image Captioning

Ke¹,

Pei²,

Li³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

State-of-the-art image captioning methods mostly focus on improving visual features, less attention has been paid to utilizing the inherent properties of language to boost captioning performance. In this paper, we show that vocabulary coherence between words and syntactic paradigm of sentences are also important to generate high-quality image caption. Following the conventional encoder-decoder framework, we propose the Reflective Decoding Network (RDN) for image captioning, which enhances both the longsequence dependency and position perception of words in a caption decoder. Our model learns to collaboratively attend on both visual and textual features and meanwhile perceive each word's relative position in the sentence to maximize the information delivered in the generated caption. We evaluate the effectiveness of our RDN on the COCO image captioning datasets and achieve superior performance over the previous methods. Further experiments reveal that our approach is particularly advantageous for hard cases with complex scenes to describe by captions.

show abstract

Section: Related Workmentioning

confidence: 99%

Reflective Decoding Network for Image Captioning

Ke¹,

Pei²,

Li³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…Automatic image captioning has drawn great attention in recent years [8][9][10][11][12][13][14][15][16][17][18][19][20][21]. Karpathy and Li [8] proposed a system to provide natural language descriptions for image regions.…”

Section: Image Description Generationmentioning

confidence: 99%

A region-based image caption generator with refined descriptions

2018

View full text Add to dashboard Cite

Abstract.Describing the content of an image is a challenging task. To enable detailed description, it requires the detection and recognition of objects, people, relationships and associated attributes. Currently, the majority of the existing research relies on holistic techniques, which may lose details relating to important aspects in a scene. In order to deal with such a challenge, we propose a novel region-based deep learning architecture for image description generation. It employs a regional object detector, recurrent neural network (RNN)-based attribute prediction, and an encoderdecoder language generator embedded with two RNNs to produce refined and detailed descriptions of a given image. Most importantly, the proposed system focuses on a local based approach to further improve upon existing holistic methods, which relates specifically to image regions of people and objects in an image. Evaluated with the IAPR TC-12 dataset, the proposed system shows impressive performance, and outperforms state-of-the-art methods using various evaluation metrics. In particular, the proposed system shows superiority over existing methods when dealing with cross-domain indoor scene images.

show abstract

“…Recent work seems to suggest that, in the endto-end learning framework, using posterior distributions over a refined set of object classes (relevant to captions) performs better than using lower level dense image representations (Wu et al, 2016;You et al, 2016). Vinyals et al (2016) note that using a better image network (a network that performs better on the image classification task) results in improvements in the generated captions.…”

Section: Studying Visual Representationsmentioning

confidence: 99%

The role of image representations in vision to language tasks

Madhyastha¹,

Wang²,

Specia³

2018

Nat. Lang. Eng.

View full text Add to dashboard Cite

Tasks that require modeling of both language and visual information such as image captioning have become very popular in recent years. Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural network-based models. However, it is not clear how different image representations contribute to language generation tasks. In this paper, we probe the representational contribution of the image features in an end-to-end neural modeling framework and study the properties of different types of image representations. We focus on two popular vision to language problems: the task of image captioning and the task of multimodal machine translation. Our analysis provides interesting insights into the representational properties and suggests that endto-end approaches implicitly learn a visual-semantic subspace and exploit the subspace to generate captions.

show abstract

What Value Do Explicit High Level Concepts Have in Vision to Language Problems?

Cited by 395 publications

References 38 publications

Reflective Decoding Network for Image Captioning

Reflective Decoding Network for Image Captioning

A region-based image caption generator with refined descriptions

The role of image representations in vision to language tasks

Contact Info

Product

Resources

About