Distinctive-Attribute Extraction for Image Captioning

Kim, Boeun; Lee, Young Han; Jung, Hyedong; Cho, Choong Sang

doi:10.1007/978-3-030-11018-5_12

Cited by 4 publications

(2 citation statements)

References 24 publications

(71 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where f w is a function that calculates the weight value allocated to w i , x i is the embedding vector of w i , and s t represents the word context vector at time t. Note that the δ t k stays the same during the generation of each word until the last time step. Here, inspired by the previous works [16] [9], we use TF-IDF method as the function f w , as this method can measure the importance degree of each word in a sentence or document. The word context vector s t is then fused with the previous hidden state h t −1 of the LSTM decoder to combine more compact semantic information to guide the visual attention, calculated as follows:…”

Section: Conceptnetmentioning

confidence: 99%

Image Captioning with Internal and External Knowledge

Huang

Chen

et al. 2020

Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

Automatically generating a human-like description for a given image is a potential research in artificial intelligence, which has attracted a great of attention recently. Most of the existing attention methods explore the mapping relationships between words in sentence and regions in image, such unpredictable matching manner sometimes causes inharmonious alignments that may reduce the quality of generated captions. In this paper, we make our efforts to reason about more accurate and meaningful captions. We first propose word attention to improve the correctness of visual attention when generating sequential descriptions word-byword. The special word attention emphasizes on word importance when focusing on different regions of the input image, and makes full use of the internal annotation knowledge to assist the calculation of visual attention. Then, in order to reveal those incomprehensible intentions that cannot be expressed straightforwardly by machines, we inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning. We validate our model on two freely available captioning benchmarks: Microsoft COCO dataset and Flickr30k dataset. The results demonstrate that our approach achieves state-of-the-art performance and outperforms many of the existing approaches. CCS CONCEPTS • Computing methodologies → Machine learning; Scene understanding; Natural language generation.

show abstract

Section: Conceptnetmentioning

confidence: 99%

Image Captioning with Internal and External Knowledge

Huang

Chen

et al. 2020

Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

show abstract

“…where f w is a function that calculates the weight value allocated to w i , x i is the embedding vector of w i , and s t represents the word context vector at time t. Note that the tk stays the same during the generation of each word until the last time step. Here, inspired by the previous works (Kim et al 2018;Park et al 2017), we use TF-IDF method as the function f w , as this method can measure the importance degree of each word in a sentence or document. The word context vector s t is then fused with the previous hidden state h t−1 of the LSTM decoder to combine more compact semantic information to guide the visual attention, calculated as follows:…”

Section: Implementation Of Word Attentionmentioning

confidence: 99%

Boost image captioning with knowledge reasoning

Huang

Wei

et al. 2020

Mach Learn

View full text Add to dashboard Cite

Automatically generating a human-like description for a given image is a potential research in artificial intelligence, which has attracted a great of attention recently. Most of the existing attention methods explore the mapping relationships between words in sentence and regions in image, such unpredictable matching manner sometimes causes inharmonious alignments that may reduce the quality of generated captions. In this paper, we make our efforts to reason about more accurate and meaningful captions. We first propose word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word. The special word attention emphasizes on word importance when focusing on different regions of the input image, and makes full use of the internal annotation knowledge to assist the calculation of visual attention. Then, in order to reveal those incomprehensible intentions that cannot be expressed straightforwardly by machines, we introduce a new strategy to inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning. Finally, we validate our model on two freely available captioning benchmarks: Microsoft COCO dataset and Flickr30k dataset. The results demonstrate that our approach achieves state-of-the-art performance and outperforms many of the existing approaches.

show abstract