DocFormer: End-to-End Transformer for Document Understanding

Appalaraju, Srikar; Jasani, Bhavan; Kota, Bhargava Urala; Xie, Yusheng; Manmatha, R.

doi:10.1109/iccv48922.2021.00103

Cited by 122 publications

(78 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years, self-supervised pre-training has achieved great success. Inspired by the development of the pre-trained language models in various NLP tasks, recent studies on structured document pre-training (Xu et al, , 2021aLi et al, 2021a,b,c;Appalaraju et al, 2021) have pushed the limits. LayoutLM modified the BERT (Devlin et al, 2019) architecture by adding 2D spatial coordinate embeddings.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, StrucTexT (Li et al, 2021c) introduced a unified solution to efficiently extract semantic features from different levels and modalities to handle the entity labeling and entity linking tasks. Doc-Former (Appalaraju et al, 2021) designed a novel multi-modal self-attention layer capable of fusing textual, vision and spatial features. Nevertheless, the aforementioned SDU approaches mainly focus on a single language -typically English, which is extremely limited with respect to multilingual application scenarios.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, inspired by the rapid development of pre-trained language models of plain texts (Devlin et al, 2019;Liu et al, 2019b;Bao et al, 2020;Chi et al, 2021), many researches on structured document pre-training (Xu et al, , 2021aLi et al, 2021a,b,c;Appalaraju et al, 2021) have also Figure 1: The substitution of language does not appear obviously unnatural when the layout structure remains unchanged, as shown in a (a) form/(b) receipt. The detailed content has been re-synthesized to avoid the sensitive information leak.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

Wang¹,

Jin²,

Ding³

2022

Preprint

View full text Add to dashboard Cite

Structured document understanding has attracted considerable attention and made significant progress recently, owing to its crucial role in intelligent document processing. However, most existing related models can only deal with the document data of specific language(s) (typically English) included in the pre-training collection, which is extremely limited. To address this issue, we propose a simple yet effective Language-independent Layout Transformer (LiLT) for structured document understanding. LiLT can be pretrained on the structured documents of a single language and then directly fine-tuned on other languages with the corresponding offthe-shelf monolingual/multilingual pre-trained textual models. Experimental results on eight languages have shown that LiLT can achieve competitive or even superior performance on diverse widely-used downstream benchmarks, which enables language-independent benefit from the pre-training of document layout structure. Code and model are publicly available at https://github.com/jpWang/LiLT.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

Wang¹,

Jin²,

Ding³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Most contemporary BERT-like pre-training models for document understanding [9], [10], [16], [17] use individual words as inputs. In a document, however, a single word can be understood within the local contexts and does not always require analyzing the entire page.…”

Section: Introductionmentioning

confidence: 99%

“…The final sentence embeddings and visual embeddings are obtained by combining textual features and visual features with spatial layout features, respectively. Different from previous works [9], [10], [16], we design the graph attention network with the gate fusion layer to do multimodal interaction instead of the Transformer architecture. We first do multimodal fusion through the designed gate fusion layer to fuse the sentence embeddings and visual embeddings.…”

Section: Introductionmentioning

confidence: 99%

Multimodal Pre-training Based on Graph Attention Network for Document Understanding

Zhang¹,

Ma²,

Du³

et al. 2022

Preprint

View full text Add to dashboard Cite

Document intelligence as a relatively new research topic supports many business applications. Its main task is to automatically read, understand, and analyze documents. However, due to the diversity of formats (invoices, reports, forms, etc.) and layouts in documents, it is difficult to make machines understand documents. In this paper, we present the GraphDoc, a multimodal graph attention-based model for various document understanding tasks. GraphDoc is pre-trained in a multimodal framework by utilizing text, layout, and image information simultaneously. In a document, a text block relies heavily on its surrounding contexts, so we inject the graph structure into the attention mechanism to form a graph attention layer so that each input node can only attend to its neighborhoods. The input nodes of each graph attention layer are composed of textual, visual, and positional features from semantically meaningful regions in a document image. We do the multimodal feature fusion of each node by the gate fusion layer. The contextualization between each node is modeled by the graph attention layer. GraphDoc learns a generic representation from only 320k unlabeled documents via the Masked Sentence Modeling task. Extensive experimental results on the publicly available datasets show that GraphDoc achieves state-of-the-art performance, which demonstrates the effectiveness of our proposed method.

show abstract

Impacts of logistics information on sales: Evidence from Alibaba

Luo

Rong

Zheng

2020

Naval Research Logistics

View full text Add to dashboard Cite

Facing fierce competition from rivals, sellers in online marketplaces are eager to improve their sales by delivering items faster and more reliably. Because logistics quality can be known only after a transaction, sellers must identify effective ways to communicate logistics information to consumers. Drawing on the accessibility‐diagnosticity framework, we theorize that the sales impacts of logistics information depend on its relative diagnostic value. Using data on 1493 items with 505,785 consumer reviews from an online marketplace, we examine how sales are affected by three information sources for logistics services: online word of mouth (WOM) about logistics, self‐reported logistics services, and expected delivery time. We use an instrumental variable method to address the endogeneity issue between sales and WOM. We find that, ceteris paribus, consumers give more weight to WOM about logistics and delivery time when they make purchase decisions but less weight to self‐reported logistics service. The effects of logistics information on sales are asymmetric for large and small sellers.

show abstract

DocFormer: End-to-End Transformer for Document Understanding

Cited by 122 publications

References 23 publications

LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

Multimodal Pre-training Based on Graph Attention Network for Document Understanding

Impacts of logistics information on sales: Evidence from Alibaba

Contact Info

Product

Resources

About