PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition

Wang, Xuwu; Tian, Junfeng; Gui, Min; Li, Zhixu; Ye, Jiabo

doi:10.1007/978-3-031-00129-1_24

Cited by 14 publications

(4 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first group of methods includes BiLSTM-CRF (Huang et al, 2015), BERT-CRF (Devlin et al, 2018) as well as the span-based NER models (e.g., BERT-span, RoBERTa-span (Yamada et al, 2020)), which only consider original text. The second group of methods includes several latest multimodal approaches for MNER task: UMT (Yu et al, 2020), UMGF , MNER-QG (Jia et al, 2022), R-GCN , ITA (Wang et al, 2021a), PromptMNER (Wang et al, 2022b), CAT-MNER (Wang et al, 2022c) and MoRe (Wang et al, 2022a), which consider both text and corresponding images.…”

Section: Resultsmentioning

confidence: 99%

“…The version of ChatGPT used in experiments is gpt-3.5-turbo and sampling temperature is set to 0. For a fair comparison, PGIM chooses to use the same text encoder XLM-RoBERTa large (Conneau et al, 2019) as ITA (Wang et al, 2021a), PromptM-NER (Wang et al, 2022b), CAT-MNER (Wang et al, 2022c) and MoRe (Wang et al, 2022a).…”

Section: Stage-2 Entity Prediction Based On Auxiliary Refined Knowledgementioning

confidence: 99%

See 1 more Smart Citation

Prompting ChatGPT in MNER: Enhanced Multimodal Named Entity Recognition with Auxiliary Refined Knowledge

Li,

Pan

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Multimodal Named Entity Recognition (MNER) on social media aims to enhance textual entity prediction by incorporating image-based clues. Existing studies mainly focus on maximizing the utilization of pertinent image information or incorporating external knowledge from explicit knowledge bases. However, these methods either neglect the necessity of providing the model with external knowledge, or encounter issues of high redundancy in the retrieved knowledge. In this paper, we present PGIM -a two-stage framework that aims to leverage ChatGPT as an implicit knowledge base and enable it to heuristically generate auxiliary knowledge for more efficient entity prediction. Specifically, PGIM contains a Multimodal Similar Example Awareness module that selects suitable examples from a small number of predefined artificial samples. These examples are then integrated into a formatted prompt template tailored to the MNER and guide ChatGPT to generate auxiliary refined knowledge. Finally, the acquired knowledge is integrated with the original text and fed into a downstream model for further processing. Extensive experiments show that PGIM outperforms state-of-the-art methods on two classic MNER datasets and exhibits a stronger robustness and generalization capability. 1

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Stage-2 Entity Prediction Based On Auxiliary Refined Knowledgementioning

confidence: 99%

Prompting ChatGPT in MNER: Enhanced Multimodal Named Entity Recognition with Auxiliary Refined Knowledge

Li,

Pan

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…There are some other approaches that do not directly use the visual information from the images, but they open the new paths to mine the hidden information behind the image. (Wang et al 2022b) designs several prompt templates for each image to bridge the gap…”

Section: Related Workmentioning

confidence: 99%

“…The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Wang et al 2022b;Jia et al 2023), which is used to guide words to get the expanded visual semantic information.…”

Section: Introductionmentioning

confidence: 99%

Hierarchical Aligned Multimodal Learning for NER on Tweet Posts

Liu,

Li,

Ren

et al. 2024

AAAI

View full text Add to dashboard Cite

Mining structured knowledge from tweets using named entity recognition (NER) can be beneficial for many downstream applications such as recommendation and intention under standing. With tweet posts tending to be multimodal, multimodal named entity recognition (MNER) has attracted more attention. In this paper, we propose a novel approach, which can dynamically align the image and text sequence and achieve the multi-level cross-modal learning to augment textual word representation for MNER improvement. To be specific, our framework can be split into three main stages: the first stage focuses on intra-modality representation learning to derive the implicit global and local knowledge of each modality, the second evaluates the relevance between the text and its accompanying image and integrates different grained visual information based on the relevance, the third enforces semantic refinement via iterative cross-modal interactions and co-attention. We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.

show abstract