Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition

Arshad, Omer; Gallo, Ignazio; Nawaz, Shah; Calefati, Alessandro

doi:10.1109/icdar.2019.00061

Cited by 37 publications

(32 citation statements)

References 15 publications

(25 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The output score of the filtration gate was computed by a sigmoid activation function. Arshad et al (2019) also presented a gated multimodal fusion representation for each token. The gated fusion is a weighted sum of visual attention feature and token alignment feature.…”

Section: Related Workmentioning

confidence: 99%

“…We illustrate four failure examples mentioned in (Lu et al, 2018) and(Arshad et al, 2019) in Table 7.…”

Section: Case Studymentioning

confidence: 99%

“…One of the reasons is that tweets are short messages and the context for inference is insufficient. Recent works on tweets based on multimodal learning have been increasing (Moon et al, 2018;Lu et al, 2018;Arshad et al, 2019). The researchers attempted to improve the performance of NER in tweets with the aid of visual clues.…”

Section: Introductionmentioning

confidence: 99%

“…Most of the multimodal NER (MNER) methods used attention weights to extract visual clues related to the NEs (Lu et al, 2018;Arshad et al, 2019) . The visual attention-based models always assume that the images in tweets are related to the texts, such as words in the text are represented in the image, e.g., Figure 1(a) shows a successful visual attention example from Lu et al (2018).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER

Sun¹,

Wang²,

Su³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Multimodal named entity recognition (MNER) for tweets has received increasing attention recently. Most of the multimodal methods used attention mechanisms to capture the text-related visual information. However, unrelated or weakly related text-image pairs account for a large proportion in tweets. Visual clues unrelated to the text would incur uncertain or even negative effects for multimodal model learning. In this paper, we propose a novel pre-trained multimodal model based on Relationship Inference and Visual Attention (RIVA) for tweets. The RIVA model controls the attention-based visual clues with a gate regarding the role of image to the semantics of text. We use a teacher-student semi-supervised paradigm to leverage a large unlabeled multimodal tweet corpus with a labeled data set for text-image relation classification. In the multimodal NER task, the experimental results show the significance of text-related visual features for the visual-linguistic model and our approach achieves SOTA performance on the MNER datasets.

show abstract

Section: Related Workmentioning

confidence: 99%

“…We illustrate four failure examples mentioned in (Lu et al, 2018) and(Arshad et al, 2019) in Table 7.…”

Section: Case Studymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER

Sun¹,

Wang²,

Su³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Typically, users combine text, image, audio or video to sell a product over an e-commence platform or express views on social media. The combination of these media types has been extensively studied to solve various tasks including classification [1], [2], [3], cross-modal retrieval [4] semantic relatedness [5], [6], image captioning [7], [8], multimodal named entity recognition [9], [10] and Visual Question Answering [11], [12]. In addition, multimodal data fueled an increased interest in generating images conditioned on natural language [13], [14].…”

Section: Introductionmentioning

confidence: 99%

Picture What You Read

Gallo

Nawaz

Calefati

et al. 2019

2019 Digital Image Computing: Techniques and Applications (DICTA)

Self Cite

View full text Add to dashboard Cite

Visualization refers to our ability to create an image in our head based on the text we read or the words we hear. It is one of the many skills that makes reading comprehension possible. Convolutional Neural Networks (CNN) are an excellent tool for recognizing and classifying text documents. In addition, it can generate images conditioned on natural language. In this work, we utilize CNNs capabilities to generate realistic images representative of the text illustrating the semantic concept. We conducted various experiments to highlight the capacity of the proposed model to generate representative images of the text descriptions used as input to the proposed model.

show abstract

Multimodal Aspect Extraction with Region-Aware Alignment Network

Cheng

Wang

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition

Cited by 37 publications

References 15 publications

RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER

RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER

Picture What You Read

Multimodal Aspect Extraction with Region-Aware Alignment Network

Contact Info

Product

Resources

About