2023
DOI: 10.1007/978-3-031-26438-2_35
|View full text |Cite
|
Sign up to set email alerts
|

Unimodal and Multimodal Representation Training for Relation Extraction

Abstract: Multimodal integration of text, layout and visual information has achieved SOTA results in visually rich document understanding (VrDU) tasks, including relation extraction (RE). However, despite its importance, evaluation of the relative predictive capacity of these modalities is less prevalent. Here, we demonstrate the value of shared representations for RE tasks by conducting experiments in which each data type is iteratively excluded during training. In addition, text and layout data are evaluated in isolat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 32 publications
0
4
0
Order By: Relevance
“…Based on the input data source there are two notable paths in representation learning literature which are unimodal and multimodal approaches [11]. As discussed in Section 2.3, we draw inspiration from the recent successes of VA RecSys employing NN-based representation learning, and observing how the resulting personalised recommendations capture hidden semantics and benefit users in many ways such as learning, discovery, enhanced engagement and better interaction experience.…”
Section: Background: Learning Latent Representations Of Visual Artmentioning
confidence: 99%
See 1 more Smart Citation
“…Based on the input data source there are two notable paths in representation learning literature which are unimodal and multimodal approaches [11]. As discussed in Section 2.3, we draw inspiration from the recent successes of VA RecSys employing NN-based representation learning, and observing how the resulting personalised recommendations capture hidden semantics and benefit users in many ways such as learning, discovery, enhanced engagement and better interaction experience.…”
Section: Background: Learning Latent Representations Of Visual Artmentioning
confidence: 99%
“…We used the pre-trained transformer-based sentiment analysis model bert-large-uncased-sst2 from the Hugging Face Transformers library. 9 This model is a fine-tuned version of bert-large-uncased which was trained on the Stanford 9 https://huggingface.co/models Sentiment Treebank v2 (SST2) 10 ; part of the General Language Understanding Evaluation (GLUE) benchmark 11 . It is well-suited for a wide range of NLP tasks due to its large size and general language understanding capabilities.…”
Section: Comparison Of Recommended Groupsmentioning
confidence: 99%
“…Unimodal misinformation involves the use of a single modality, such as text or image. On the other hand, multimodal misinformation relies on the combination of multiple modalities, such as image-caption pairs [58], [59].…”
Section: Related Workmentioning
confidence: 99%
“…State-ofthe-art LMMs integrate advanced computer vision models (Ren et al, 2015;He et al, 2016) within BERT-like architectures (Devlin et al, 2019) to leverage spatial and visual information along with text and learn multimodal fused representations for form-like documents. However, these representations are biased toward textual and spatial modalities (Cooney et al, 2023) and have limited per-formance, especially when the data contains richer visual information. This is because the visual encoder in these models usually plays a secondary role compared to advanced text encoders.…”
Section: Introductionmentioning
confidence: 99%