Automatically generating a human-like description for a given image is a potential research in artificial intelligence, which has attracted a great of attention recently. Most of the existing attention methods explore the mapping relationships between words in sentence and regions in image, such unpredictable matching manner sometimes causes inharmonious alignments that may reduce the quality of generated captions. In this paper, we make our efforts to reason about more accurate and meaningful captions. We first propose word attention to improve the correctness of visual attention when generating sequential descriptions word-byword. The special word attention emphasizes on word importance when focusing on different regions of the input image, and makes full use of the internal annotation knowledge to assist the calculation of visual attention. Then, in order to reveal those incomprehensible intentions that cannot be expressed straightforwardly by machines, we inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning. We validate our model on two freely available captioning benchmarks: Microsoft COCO dataset and Flickr30k dataset. The results demonstrate that our approach achieves state-of-the-art performance and outperforms many of the existing approaches. CCS CONCEPTS • Computing methodologies → Machine learning; Scene understanding; Natural language generation.
Automatically generating a human-like description for a given image is a potential research in artificial intelligence, which has attracted a great of attention recently. Most of the existing attention methods explore the mapping relationships between words in sentence and regions in image, such unpredictable matching manner sometimes causes inharmonious alignments that may reduce the quality of generated captions. In this paper, we make our efforts to reason about more accurate and meaningful captions. We first propose word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word. The special word attention emphasizes on word importance when focusing on different regions of the input image, and makes full use of the internal annotation knowledge to assist the calculation of visual attention. Then, in order to reveal those incomprehensible intentions that cannot be expressed straightforwardly by machines, we introduce a new strategy to inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning. Finally, we validate our model on two freely available captioning benchmarks: Microsoft COCO dataset and Flickr30k dataset. The results demonstrate that our approach achieves state-of-the-art performance and outperforms many of the existing approaches.
Most existing image captioning methods use only the visual information of the image to guide the generation of captions, lack the guidance of effective scene semantic information, and the current visual attention mechanism cannot adjust the focus intensity on the image. In this article, we first propose an improved visual attention model. At each timestep, we calculated the focus intensity coefficient of the attention mechanism through the context information of the model, then automatically adjusted the focus intensity of the attention mechanism through the coefficient to extract more accurate visual information. In addition, we represented the scene semantic knowledge of the image through topic words related to the image scene, then added them to the language model. We used the attention mechanism to determine the visual information and scene semantic information that the model pays attention to at each timestep and combined them to enable the model to generate more accurate and scene-specific captions. Finally, we evaluated our model on Microsoft COCO (MSCOCO) and Flickr30k standard datasets. The experimental results show that our approach generates more accurate captions and outperforms many recent advanced models in various evaluation metrics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.