Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu 2018
DOI: 10.18653/v1/n18-1078
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Named Entity Recognition for Short Social Media Posts

Abstract: We introduce a new task called Multimodal Named Entity Recognition (MNER) for noisy user-generated data such as tweets or Snapchat captions, which comprise short text with accompanying images. These social media posts often come in inconsistent or incomplete syntax and lexical notations with very limited surrounding textual contexts, bringing significant challenges for NER. To this end, we create a new dataset for MNER called SnapCaptions (Snapchat image-caption pairs submitted to public and crowd-sourced stor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
93
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 114 publications
(93 citation statements)
references
References 23 publications
(43 reference statements)
0
93
0
Order By: Relevance
“…On the other hand, textual attention techniques find semantic or syntactic alignments in handling long-term dependencies. Attention techniques have been extensively employed to vision and text related tasks, such as Image Captioning [14], VQA [9], Cross-Modal Retrieval [15] NER [10], [11], [12]. Multimodal Named Entity Recognition.…”
Section: Related Workmentioning
confidence: 99%
“…On the other hand, textual attention techniques find semantic or syntactic alignments in handling long-term dependencies. Attention techniques have been extensively employed to vision and text related tasks, such as Image Captioning [14], VQA [9], Cross-Modal Retrieval [15] NER [10], [11], [12]. Multimodal Named Entity Recognition.…”
Section: Related Workmentioning
confidence: 99%
“…f (•) is a transformation function with stacked layers that projects weighted input to the KB embeddings space,ỹ refers to the embeddings of negative samples randomly sampled from KB entities except the ground truth label of the instance, W = {W f ,W c ,W w ,W v } are the learnable parameters for f , c, w, and v respectively, and R(W) is a weight decay regularization term. Similarly to (Moon et al, 2018), we formulate the modality attention module for our MNED network as follows, which selectively attenuates or amplifies modalities:…”
Section: Deep Zeroshot Mned Network (Dzmned)mentioning
confidence: 99%
“…Multimodal learning studies learning of a joint model that leverages contextual information from multiple modalities in parallel. Some of the relevant multimodal learning task to our MNED system include the multimodal named entity recognition task (Moon et al, 2018), which leverages both text and image to classify each token in a sentence to named entity or not. In their work, they employ an entity LSTM that takes as input each modality, and a softmax layer that outputs an entity label at each decoding step.…”
Section: Related Workmentioning
confidence: 99%
“…We compared against models based on different multimodal feature representation strategies: Concatenation (CONCAT) [35,11,36]: The model creates a multimodal representation of words by simply concatenating the unimodal features at the word level. Attention-based Weighted Sum (ATTN) [37,16]: Instead of concatenating the unimodal signals as alternative feature vectors, the model uses an attention network to decide how to combine the information for the final representation. Tensor Fusion Network (TFN) [14]: This strategy models both the modality-specific and cross-modal interactions by computing an outer product over a set of unimodal vectors (with an extra constant dimension 1) rather than just the concatenation.…”
Section: Comparison Modelsmentioning
confidence: 99%