Baby talk: Understanding and generating simple image descriptions

Kulkarni, Girish; Premraj, Visruth; Dhar, Sagnik; Li, Siming; Choi, Yejin; Berg, Alexander C.; Berg, Tamara L.

doi:10.1109/cvpr.2011.5995466

Cited by 491 publications

(445 citation statements)

References 25 publications

Supporting

Mentioning

442

Contrasting

Unclassified

Order By: Relevance

“…A good caption for such an image is often only loosely related to the content of the image. The setting of this work is therefore different from that in [5][6][7][8][9][10][11][12], where the objective is to generate a caption that describes what is depicted in the image.…”

Section: Related Workmentioning

confidence: 99%

“…However, even with 1 million images, it is unrealistic to expect that every possible query image with various objects and actions can be represented and found in such dataset. In contrast to this caption transfer approach, the work in [6][7][8][9][10][11][12] adopts the conventional content selection and surface realisation approach. Starting from the output of visual processing engines e.g.…”

Section: Related Workmentioning

confidence: 99%

“…One of the main issues with the work in [4][5][6][7][8][9][10][11][12][13][14][15][16], however, is the lack of automatic and objective evaluation metric. In [17] the problem of generating natural language description for a given image is relaxed to one of ranking a set of humanwritten captions, by assuming the set contains the original (human-written) caption of the image.…”

Section: Introductionmentioning

confidence: 99%

“…face recognition from caption-based supervision [1], text-to-image coreference [2], and zero-shot visual learning using purely textural description [3]. In particular, generating natural language description for image and video has attracted much interest in both CV and NLP communities [4][5][6][7][8][9][10][11][12][13][14][15][16].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Leveraging High Level Visual Information for Matching Images and Captions

Yan

Mikolajczyk

2015

Computer Vision – ACCV 2014

View full text Add to dashboard Cite

Abstract. In this paper we investigate the problem of matching images and captions. We exploit the kernel canonical correlation analysis (KCCA) to learn a similarity between images and texts. We then propose methods to build improved visual and text kernels. The visual kernels are based on visual classifiers that use responses of a deep convolutional neural network as features, and the text kernel improves the Bag-of-Words (BoW) representation by learning a vision based lexical similarity between words. We consider two application scenarios, one where only an external image set weakly related to the evaluation dataset is available for training the visual classifiers, and one where visual data closely related to the evaluation set can be used. We evaluate our visual and text kernels on a large and publicly available benchmark, where we show that our proposed methods substantially improve upon the state-of-the-art.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Leveraging High Level Visual Information for Matching Images and Captions

Yan

Mikolajczyk

2015

Computer Vision – ACCV 2014

View full text Add to dashboard Cite

show abstract

“…These can again be used both for image retrieval and generating short descriptive sentences given an image. Kulkarni et al [6] push this work a step further by generating more complex, natural language descriptions which are able to describe multiple objects, their attributes, and their spatial relations. Our work can be considered as a reverse process of this line of work.…”

Section: Introductionmentioning

confidence: 99%

Image Retrieval with Structured Object Queries Using Latent Ranking SVM

Lan

Yang

Wang

et al. 2012

Computer Vision – ECCV 2012

View full text Add to dashboard Cite

Abstract. We consider image retrieval with structured object queriesqueries that specify the objects that should be present in the scene, and their spatial relations. An example of such queries is "car on the road". Existing image retrieval systems typically consider queries consisting of object classes (i.e. keywords). They train a separate classifier for each object class and combine the output heuristically. In contrast, we develop a learning framework to jointly consider object classes and their relations. Our method considers not only the objects in the query ("car" and "road" in the above example), but also related object categories can be useful for retrieval. Since we do not have ground-truth labeling of object bounding boxes on the test image, we represent them as latent variables in our model. Our learning method is an extension of the ranking SVM with latent variables, which we call latent ranking SVM. We demonstrate image retrieval and ranking results on a dataset with more than a hundred of object classes.

show abstract

Generating descriptive multi‐document summaries of geo‐located entities using entity type models

Aker

Gaizauskas

2014

Asso for Info Science & Tech

View full text Add to dashboard Cite

In this article, we investigate the application of entity type models in extractive multi-document summarization using automatic caption generation for images of geo-located entities (e.g., Westminster Abbey) as an application scenario. Entity type models contain sets of patterns aiming to capture the ways geo-located entities are described in natural language. They are automatically derived from texts about geo-located entities of the same type (e.g., churches, lakes). We integrate entity type models into a multi-document summarizer and use them to address the 2 major tasks in extractive multidocument summarization: sentence scoring and summary composition. We experiment with 3 different representation methods for entity type models: signature words, n-gram language models, and dependency patterns. We evaluate the summarizer with integrated entity type models relative to (a) a summarizer using standard text-related features commonly used in text summarization and (b) the Wikipedia location descriptions. Our results show that entity type models significantly improve the quality of output summaries over that of summaries generated using standard summarization features and Wikipedia summaries. The representation of entity type models using dependency patterns is superior to the representations using signature words and n-gram language models.

show abstract

Baby talk: Understanding and generating simple image descriptions

Cited by 491 publications

References 25 publications

Leveraging High Level Visual Information for Matching Images and Captions

Leveraging High Level Visual Information for Matching Images and Captions

Image Retrieval with Structured Object Queries Using Latent Ranking SVM

Generating descriptive multi‐document summaries of geo‐located entities using entity type models

Contact Info

Product

Resources

About