Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.201
|View full text |Cite
|
Sign up to set email alerts
|

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

Abstract: Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
192
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 193 publications
(237 citation statements)
references
References 30 publications
1
192
0
Order By: Relevance
“…LayoutLM (Xu et al, 2020(Xu et al, , 2021 incorporates multimodal self-supervised learning to utilize deep learning for form understanding. While it may alleviate the need for a large training dataset, it is not trivial to adopt the same method for logical structure analysis as text blocks would not fit onto the LayoutLM's context.…”
Section: Related Workmentioning
confidence: 99%
“…LayoutLM (Xu et al, 2020(Xu et al, , 2021 incorporates multimodal self-supervised learning to utilize deep learning for form understanding. While it may alleviate the need for a large training dataset, it is not trivial to adopt the same method for logical structure analysis as text blocks would not fit onto the LayoutLM's context.…”
Section: Related Workmentioning
confidence: 99%
“…{ "items": [ { "name": "3002-Kyoto Choco Mochi", "count": 2, "priceInfo": { "unitPrice": 14000, "price": 28000 } }, { "name": "1001 -Choco Bun", "count": 1, "priceInfo": { "unitPrice": 22000 "price": 22000 } }, ... ], "total": [ { "menuqty_cnt": 4, "total_price": 50000 } ] } { "words": [ { "id": 1, "bbox": [[360,2048],..., [355,2127]], "text": "3002-Kyoto" }, { "id": 2, "bbox": [[801,2074],..., [801,2139]], "text": "Choco" }, { "id": 3, "bbox": [[1035,2074],..., [1035,2147]], "text": "Mochi" }, { "id": 4, "bbox": [[761,2172],..., [761,2253]], "text": "14.000" }, …, { "id": 22, "bbox": [[1573,3030],..., [1571,3126]], "text": "50.000" } ] } text information as input and perform their own objectives with the OCR-extracted texts. (Katti et al, 2018;Hwang et al, 2019Hwang et al, , 2020Hwang et al, , 2021aSage et al, 2020;Majumder et al, 2020a;Xu et al, 2019Xu et al, , 2021. For example, (Hwang et al, 2019), a currently-deployed document parsing system for business card and receipt images, consists of three separate modules for text detection, text recognition, and parsing (See Figure 2).…”
Section: Document Imagementioning
confidence: 99%
“…Earlier attempts in VDU have been done with vision-based approaches (Kang et al, 2014;Afzal et al, 2015;Harley et al, 2015a), showing the importance of textual understanding in VDU (Xu et al, 2019). With the emergence of BERT (Devlin et al, 2018), most state-of-the-arts (Xu et al, 2019(Xu et al, , 2021Hong et al, 2021) combined the computer vision (CV) and natural language processing (NLP) techniques and showed remarkable advances in recent years.…”
Section: Preliminary: Backgroundmentioning
confidence: 99%
See 2 more Smart Citations