Multimodal Named Entity Recognition for Short Social Media Posts

Moon, Seungwhan; Neves, Lucio Pereira; Carvalho, Vitor R.

doi:10.18653/v1/n18-1078

Cited by 114 publications

(93 citation statements)

References 23 publications

(43 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, textual attention techniques find semantic or syntactic alignments in handling long-term dependencies. Attention techniques have been extensively employed to vision and text related tasks, such as Image Captioning [14], VQA [9], Cross-Modal Retrieval [15] NER [10], [11], [12]. Multimodal Named Entity Recognition.…”

Section: Related Workmentioning

confidence: 99%

Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition

Arshad

Gallo

Nawaz

et al. 2019

2019 International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

With massive explosion of social media such as Twitter and Instagram, people daily share billions of multimedia posts, containing images and text. Typically, text in these posts is short, informal and noisy, leading to ambiguities which can be resolved using images. In this paper we explore text-centric Named Entity Recognition task on these multimedia posts. We propose an end to end model which learns a joint representation of a text and an image. Our model extends multi-dimensional self attention technique, where now image help to enhance relationship between words. Experiments show that our model is capable of capturing both textual and visual contexts with greater accuracy, achieving state-of-the-art results on Twitter multimodal Named Entity Recognition dataset.• Unrelated image : Text information do not match with an image, as we can see in Fig. 8(a), "Reddit" belongs to

show abstract

Section: Related Workmentioning

confidence: 99%

Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition

Arshad

Gallo

Nawaz

et al. 2019

2019 International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

show abstract

“…f (•) is a transformation function with stacked layers that projects weighted input to the KB embeddings space,ỹ refers to the embeddings of negative samples randomly sampled from KB entities except the ground truth label of the instance, W = {W f ,W c ,W w ,W v } are the learnable parameters for f , c, w, and v respectively, and R(W) is a weight decay regularization term. Similarly to (Moon et al, 2018), we formulate the modality attention module for our MNED network as follows, which selectively attenuates or amplifies modalities:…”

Section: Deep Zeroshot Mned Network (Dzmned)mentioning

confidence: 99%

“…Multimodal learning studies learning of a joint model that leverages contextual information from multiple modalities in parallel. Some of the relevant multimodal learning task to our MNED system include the multimodal named entity recognition task (Moon et al, 2018), which leverages both text and image to classify each token in a sentence to named entity or not. In their work, they employ an entity LSTM that takes as input each modality, and a softmax layer that outputs an entity label at each decoding step.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Named Entity Disambiguation for Noisy Social Media Posts

Moon¹,

Neves²,

Carvalho³

2018

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Self Cite

View full text Add to dashboard Cite

We introduce the new Multimodal Named Entity Disambiguation (MNED) task for multimodal social media posts such as Snapchat or Instagram captions, which are composed of short captions with accompanying images. Social media posts bring significant challenges for disambiguation tasks because 1) ambiguity not only comes from polysemous entities, but also from inconsistent or incomplete notations, 2) very limited context is provided with surrounding words, and 3) there are many emerging entities often unseen during training. To this end, we build a new dataset called SnapCaptionsKB, a collection of Snapchat image captions submitted to public and crowd-sourced stories, with named entity mentions fully annotated and linked to entities in an external knowledge base. We then build a deep zeroshot multimodal network for MNED that 1) extracts contexts from both text and image, and 2) predicts correct entity in the knowledge graph embeddings space, allowing for zeroshot disambiguation of entities unseen in training set as well. The proposed model significantly outperforms the stateof-the-art text-only NED models, showing efficacy and potentials of the MNED task.

show abstract

“…We compared against models based on different multimodal feature representation strategies: Concatenation (CONCAT) [35,11,36]: The model creates a multimodal representation of words by simply concatenating the unimodal features at the word level. Attention-based Weighted Sum (ATTN) [37,16]: Instead of concatenating the unimodal signals as alternative feature vectors, the model uses an attention network to decide how to combine the information for the final representation. Tensor Fusion Network (TFN) [14]: This strategy models both the modality-specific and cross-modal interactions by computing an outer product over a set of unimodal vectors (with an extra constant dimension 1) rather than just the concatenation.…”

Section: Comparison Modelsmentioning

confidence: 99%

Fusion Strategy for Prosodic and Lexical Representations of Word Importance

2019

View full text Add to dashboard Cite

We investigate whether, and if so when, prosodic features in spoken dialogue aid in modeling the importance of words to the overall meaning of a dialogue turn. Starting from the assumption that acoustic-prosodic cues help identify important speech content, we investigate representation architectures that combine lexical and prosodic features and evaluate them for predicting word importance. We propose an attention-based feature fusion strategy and additionally show how the addition of strategic supervision of the attention weights results in especially competitive models. We evaluate our fusion strategy on spoken dialogues and demonstrate performance increases over state-ofthe-art models. Specifically, our approach both achieves the lowest root mean square error on test data and generalizes better over out-of-vocabulary words.

show abstract

Multimodal Named Entity Recognition for Short Social Media Posts

Cited by 114 publications

References 23 publications

Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition

Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition

Multimodal Named Entity Disambiguation for Noisy Social Media Posts

Fusion Strategy for Prosodic and Lexical Representations of Word Importance

Contact Info

Product

Resources

About