2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00103
|View full text |Cite
|
Sign up to set email alerts
|

DocFormer: End-to-End Transformer for Document Understanding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
70
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 122 publications
(78 citation statements)
references
References 23 publications
0
70
0
Order By: Relevance
“…In recent years, self-supervised pre-training has achieved great success. Inspired by the development of the pre-trained language models in various NLP tasks, recent studies on structured document pre-training (Xu et al, , 2021aLi et al, 2021a,b,c;Appalaraju et al, 2021) have pushed the limits. LayoutLM modified the BERT (Devlin et al, 2019) architecture by adding 2D spatial coordinate embeddings.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…In recent years, self-supervised pre-training has achieved great success. Inspired by the development of the pre-trained language models in various NLP tasks, recent studies on structured document pre-training (Xu et al, , 2021aLi et al, 2021a,b,c;Appalaraju et al, 2021) have pushed the limits. LayoutLM modified the BERT (Devlin et al, 2019) architecture by adding 2D spatial coordinate embeddings.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, StrucTexT (Li et al, 2021c) introduced a unified solution to efficiently extract semantic features from different levels and modalities to handle the entity labeling and entity linking tasks. Doc-Former (Appalaraju et al, 2021) designed a novel multi-modal self-attention layer capable of fusing textual, vision and spatial features. Nevertheless, the aforementioned SDU approaches mainly focus on a single language -typically English, which is extremely limited with respect to multilingual application scenarios.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Most contemporary BERT-like pre-training models for document understanding [9], [10], [16], [17] use individual words as inputs. In a document, however, a single word can be understood within the local contexts and does not always require analyzing the entire page.…”
Section: Introductionmentioning
confidence: 99%
“…The final sentence embeddings and visual embeddings are obtained by combining textual features and visual features with spatial layout features, respectively. Different from previous works [9], [10], [16], we design the graph attention network with the gate fusion layer to do multimodal interaction instead of the Transformer architecture. We first do multimodal fusion through the designed gate fusion layer to fuse the sentence embeddings and visual embeddings.…”
Section: Introductionmentioning
confidence: 99%