LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

Xu, Yang; Xu, Yanjin; Lv, Tengchao; Cui, Lei; Wei, Furu; Wang, Guoxin; Lu, Yijuan; Florêncio, Dinei; Cha, Zhang; Che, Wanxiang; Zhang, Min; Zhou, Lidong

doi:10.18653/v1/2021.acl-long.201

Cited by 193 publications

(237 citation statements)

References 30 publications

Supporting

Mentioning

192

Contrasting

Order By: Relevance

“…LayoutLM (Xu et al, 2020(Xu et al, , 2021 incorporates multimodal self-supervised learning to utilize deep learning for form understanding. While it may alleviate the need for a large training dataset, it is not trivial to adopt the same method for logical structure analysis as text blocks would not fit onto the LayoutLM's context.…”

Section: Related Workmentioning

confidence: 99%

Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser

Koreeda¹,

Manning²

2021

Proceedings of the Natural Legal Language Processing Workshop 2021

View full text Add to dashboard Cite

While many NLP pipelines assume raw, clean texts, many texts we encounter in the wild, including a vast majority of legal documents, are not so clean, with many of them being visually structured documents (VSDs) such as PDFs. Conventional preprocessing tools for VSDs mainly focused on word segmentation and coarse layout analysis, whereas finegrained logical structure analysis (such as identifying paragraph boundaries and their hierarchies) of VSDs is underexplored. To that end, we proposed to formulate the task as prediction of transition labels between text fragments that maps the fragments to a tree, and developed a feature-based machine learning system that fuses visual, textual and semantic cues. Our system is easily customizable to different types of VSDs and it significantly outperformed baselines in identifying different structures in VSDs. For example, our system obtained a paragraph boundary detection F1 score of 0.953 which is significantly better than a popular PDF-to-text tool with an F1 score of 0.739.

show abstract

Section: Related Workmentioning

confidence: 99%

Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser

Koreeda¹,

Manning²

2021

Proceedings of the Natural Legal Language Processing Workshop 2021

View full text Add to dashboard Cite

show abstract

“…{ "items": [ { "name": "3002-Kyoto Choco Mochi", "count": 2, "priceInfo": { "unitPrice": 14000, "price": 28000 } }, { "name": "1001 -Choco Bun", "count": 1, "priceInfo": { "unitPrice": 22000 "price": 22000 } }, ... ], "total": [ { "menuqty_cnt": 4, "total_price": 50000 } ] } { "words": [ { "id": 1, "bbox": [[360,2048],..., [355,2127]], "text": "3002-Kyoto" }, { "id": 2, "bbox": [[801,2074],..., [801,2139]], "text": "Choco" }, { "id": 3, "bbox": [[1035,2074],..., [1035,2147]], "text": "Mochi" }, { "id": 4, "bbox": [[761,2172],..., [761,2253]], "text": "14.000" }, …, { "id": 22, "bbox": [[1573,3030],..., [1571,3126]], "text": "50.000" } ] } text information as input and perform their own objectives with the OCR-extracted texts. (Katti et al, 2018;Hwang et al, 2019Hwang et al, , 2020Hwang et al, , 2021aSage et al, 2020;Majumder et al, 2020a;Xu et al, 2019Xu et al, , 2021. For example, (Hwang et al, 2019), a currently-deployed document parsing system for business card and receipt images, consists of three separate modules for text detection, text recognition, and parsing (See Figure 2).…”

Section: Document Imagementioning

confidence: 99%

“…Earlier attempts in VDU have been done with vision-based approaches (Kang et al, 2014;Afzal et al, 2015;Harley et al, 2015a), showing the importance of textual understanding in VDU (Xu et al, 2019). With the emergence of BERT (Devlin et al, 2018), most state-of-the-arts (Xu et al, 2019(Xu et al, , 2021Hong et al, 2021) combined the computer vision (CV) and natural language processing (NLP) techniques and showed remarkable advances in recent years.…”

Section: Preliminary: Backgroundmentioning

confidence: 99%

“…As aforementioned, current state-of-the-arts in VDU are heavily relying on large-scale real document images to train the model (Lewis et al, 2006;Xu et al, 2019Xu et al, , 2021Li et al, 2021). However, this approach is not always available in real-world production environments, in particular handling diverse languages other than English.…”

Section: Pre-trainingmentioning

confidence: 99%

“…In some tasks, states-of-theart commercial OCR products are used. At the estimation of OCR speeds, we utilize Microsoft OCR API used in Xu et al (2021). In document parsing tasks, we use CLOVA OCR 2 specialized in the OCR on receipts and business cards images.…”

Section: Common Settingmentioning

confidence: 99%

See 2 more Smart Citations

OCR-free Document Understanding Transformer

Hong¹,

Yim²,

Nam³

et al. 2021

Preprint

View full text Add to dashboard Cite

Understanding document images (e.g., invoices) has been an important research topic and has many applications in document processing automation. Through the latest advances in deep learning-based Optical Character Recognition (OCR), current Visual Document Understanding (VDU) systems have come to be designed based on OCR. Although such OCR-based approach promise reasonable performance, they suffer from critical problems induced by the OCR, e.g., (1) expensive computational costs and (2) performance degradation due to the OCR error propagation. In this paper, we propose a novel VDU model that is end-to-end trainable without underpinning OCR framework. To this end, we propose a new task and a synthetic document image generator to pre-train the model to mitigate the dependencies on largescale real document images. Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets. Through extensive experiments and analysis, we demonstrate the effectiveness of the proposed model especially with consideration for a real-world application.

show abstract

Impacts of logistics information on sales: Evidence from Alibaba

Luo

Rong

Zheng

2020

Naval Research Logistics

View full text Add to dashboard Cite

Facing fierce competition from rivals, sellers in online marketplaces are eager to improve their sales by delivering items faster and more reliably. Because logistics quality can be known only after a transaction, sellers must identify effective ways to communicate logistics information to consumers. Drawing on the accessibility‐diagnosticity framework, we theorize that the sales impacts of logistics information depend on its relative diagnostic value. Using data on 1493 items with 505,785 consumer reviews from an online marketplace, we examine how sales are affected by three information sources for logistics services: online word of mouth (WOM) about logistics, self‐reported logistics services, and expected delivery time. We use an instrumental variable method to address the endogeneity issue between sales and WOM. We find that, ceteris paribus, consumers give more weight to WOM about logistics and delivery time when they make purchase decisions but less weight to self‐reported logistics service. The effects of logistics information on sales are asymmetric for large and small sellers.

show abstract

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

Cited by 193 publications

References 30 publications

Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser

Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser

OCR-free Document Understanding Transformer

Impacts of logistics information on sales: Evidence from Alibaba

Contact Info

Product

Resources

About