2019 International Conference on Document Analysis and Recognition (ICDAR) 2019
DOI: 10.1109/icdar.2019.00061
|View full text |Cite
|
Sign up to set email alerts
|

Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition

Abstract: With massive explosion of social media such as Twitter and Instagram, people daily share billions of multimedia posts, containing images and text. Typically, text in these posts is short, informal and noisy, leading to ambiguities which can be resolved using images. In this paper we explore text-centric Named Entity Recognition task on these multimedia posts. We propose an end to end model which learns a joint representation of a text and an image. Our model extends multi-dimensional self attention technique, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
32
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 37 publications
(32 citation statements)
references
References 15 publications
(25 reference statements)
0
32
0
Order By: Relevance
“…The output score of the filtration gate was computed by a sigmoid activation function. Arshad et al (2019) also presented a gated multimodal fusion representation for each token. The gated fusion is a weighted sum of visual attention feature and token alignment feature.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…The output score of the filtration gate was computed by a sigmoid activation function. Arshad et al (2019) also presented a gated multimodal fusion representation for each token. The gated fusion is a weighted sum of visual attention feature and token alignment feature.…”
Section: Related Workmentioning
confidence: 99%
“…We illustrate four failure examples mentioned in (Lu et al, 2018) and(Arshad et al, 2019) in Table 7.…”
Section: Case Studymentioning
confidence: 99%
See 2 more Smart Citations
“…Typically, users combine text, image, audio or video to sell a product over an e-commence platform or express views on social media. The combination of these media types has been extensively studied to solve various tasks including classification [1], [2], [3], cross-modal retrieval [4] semantic relatedness [5], [6], image captioning [7], [8], multimodal named entity recognition [9], [10] and Visual Question Answering [11], [12]. In addition, multimodal data fueled an increased interest in generating images conditioned on natural language [13], [14].…”
Section: Introductionmentioning
confidence: 99%