Automatic description generation from natural images is a challenging problem that has recently received a large amount of interest from the computer vision and natural language processing communities. In this survey, we classify the existing approaches based on how they conceptualize this problem, viz., models that cast description as either generation problem or as a retrieval problem over a visual or multimodal representational space. We provide a detailed review of existing models, highlighting their advantages and disadvantages. Moreover, we give an overview of the benchmark image datasets and the evaluation measures that have been developed to assess the quality of machine-generated image descriptions. Finally we extrapolate future directions in the area of automatic image description generation.
Automatic image description generation is a challenging problem that has recently received a large amount of interest from the computer vision and natural language processing communities. In this survey, we classify the known approaches based on how they conceptualise this problem and provide a review of existing models, highlighting their advantages and disadvantages. Moreover, we give an overview of the benchmark image-text datasets and the evaluation measures that have been developed to assess the quality of machine-generated descriptions. Finally we explore future directions in the area of automatic image description. IntroductionThe task of automatic image description involves taking an image, analyzing its visual content, and generating a textual description (typically a sentence) that verbalizes the most salient aspects of the image. This requires the joint use of both Computer Vision (CV) and Natural Language Processing (NLP) techniques.From a CV point of view, the description could in principle cover any visual aspect of the image: it can talk about objects and their attributes, features of the scene (e.g., indoor/outdoor), or verbalize the interaction of the people and objects in the scene. More challenging, it can reference objects that are not depicted (e.g., people waiting for a train, even when the train is not visible because it has not arrived yet) and provide background knowledge that cannot be derived directly from the image (e.g., the person depicted is the Mona Lisa). In other words, image understanding (which essentially produces an unstructured list of object, scene and * This paper is an extended abstract of an article in the Journal of Artificial Intelligence Research [Bernardi et al., 2016].
The context for the work we report here is the automatic description of spatial relationships between pairs of objects in images. We investigate the task of selecting prepositions for such spatial relationships. We describe the two datasets of object pairs and prepositions we have created for English and French, and report results for predicting prepositions for object pairs in both of these languages, using two methods: (a) an existing approach which manually fixes the mapping from geometrical features to prepositions, and (b) a Naive Bayes classifier trained on the English and French datasets. For the latter we use features based on object class labels and geometrical measurements of object bounding boxes. We evaluate the automatically generated prepositions on unseen data in terms of accuracy against the human-selected prepositions.
Detection of spatial relations between objects in images is currently a popular subject in image description research. A range of different language and geometric object features have been used in this context, but methods have not so far used explicit information about the third dimension (depth), except when manually added to annotations. The lack of such information hampers detection of spatial relations that are inherently 3D. In this paper, we use a fully automatic method for creating a depth map of an image and derive several different object-level depth features from it which we add to an existing feature set to test the effect on spatial relation detection. We show that performance increases are obtained from adding depth features in all scenarios tested.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.