LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

Wang, Jiapeng; Jin, Lianwen; Ding, Kai

doi:10.48550/arxiv.2202.13669

Cited by 5 publications

(6 citation statements)

References 20 publications

(44 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LiLT (Wang et al, 2022) is a multimodal model which takes both text and bounding boxes as input. The entire framework represents a parallel dualstream Transformer that concurrently processes two streams of information: one for text and the other for layout.…”

Section: Bibliography Detectormentioning

confidence: 99%

“…However, the license of the LayoutLMv3 prohibits it from being used in industry. A good alternative for industrial use cases instead, is the Language-independent Layout Transformer (LiLT), a multimodal model, which overcomes the language barrier and decouples and learns the layout knowledge from the monolingual structured documents before generalizing it to the multilingual (Wang et al, 2022).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An End-to-End Pipeline for Bibliography Extraction from Scientific Articles

Joshi,

Symeonidou,

Danish

et al. 2023

Proceedings of the Second Workshop on Information Extraction From Scientific Publications

View full text Add to dashboard Cite

We introduce a comprehensive end-to-end pipeline designed to extract complete bibliography section from English scientific articles in digital-born PDF format and further split them into individual citations. At the heart of our pipeline lies the utilization of Languageindependent Layout Transformer (LiLT), a multimodal model that combines text and layout features to enhance the accuracy and robustness of bibliography extraction. By considering both text and visual structure, LiLT significantly improves the identification of bibliographic sections within scientific articles. To split the extracted full bibliography into individual citations, we employ a custom fine-tuned version of SciBERT, a Transformer-based model that excels at handling complex formatting variations common in scholarly bibliography.Having such end-to-end pipeline in-house allows us to bypass reliance on third-party black box tools, such as GROBID, offering greater control and transparency in the bibliography extraction process. Another highlight of our pipeline is its extensibility, as it can be seamlessly adapted to multilingual and image-based PDFs, hence allowing its utility across a wide range of scholarly content. When evaluated on an in-house dataset of digital-born English PDF articles published at Elsevier, we achieved an F1-score of 94.6%, a notable 3.1% improvement over GROBID, which is a well-regarded tool for bibliography parsing in the industry.

show abstract

Section: Bibliography Detectormentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

An End-to-End Pipeline for Bibliography Extraction from Scientific Articles

Joshi,

Symeonidou,

Danish

et al. 2023

Proceedings of the Second Workshop on Information Extraction From Scientific Publications

View full text Add to dashboard Cite

show abstract

“…Then the concatenated feature vectors are fed into a multi-modal transformer encoder-decoder to generate the bounding boxes, with a [CLS] special token prepended. In order to fully exploit the bounding box information, we use a layout enhanced Roberta model (Wang, Jin, and Ding 2022) instead of the vallina Roberta, which can output the original language hidden states and layout hidden states separately.…”

Section: Answer Location Modulementioning

confidence: 99%

Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Zhang

Liu

Liang

et al. 2023

AAAI

View full text Add to dashboard Cite

In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (STVQA), which requires models to read scene text in images for question answering. Apart from text or visual objects, which could exist independently, scene text naturally links text and visual modalities together by conveying linguistic semantics while being a visual object in an image simultaneously. Different to conventional STVQA models which take the linguistic semantics and visual semantics in scene text as two separate features, in this paper, we propose a paradigm of "Locate Then Generate" (LTG), which explicitly unifies this two semantics with the spatial bounding box as a bridge connecting them. Specifically, at first, LTG locates the region in an image that may contain the answer words with an answer location module (ALM) consisting of a region proposal network and a language refinement network, both of which can transform to each other with one-to-one mapping via the scene text bounding box. Next, given the answer words selected by ALM, LTG generates a readable answer sequence with an answer generation module (AGM) based on a pre-trained language model. As a benefit of the explicit alignment of the visual and linguistic semantics, even without any scene text based pre-training tasks, LTG can boost the absolute accuracy by +6.06% and +6.92% on the TextVQA dataset and the ST-VQA dataset respectively, compared with a non-pre-training baseline. We further demonstrate that LTG effectively unifies visual and text modalities through the spatial bounding box connection, which is underappreciated in previous methods.

show abstract

“…Much of today's Information Extraction (IE) is done using probability-based token-classification models such as BERT (Devlin et al, 2018), RoBERTa (Liu et al, 2019), LayoutLM (Xu et al, 2020b,a;Huang et al, 2022) or LiLT (Wang et al, 2022). These models aim for the best results by increasingly stacking large amounts of parameters, which comes at the cost of increased computational requirements and training complexity.…”

Section: Introductionmentioning

confidence: 99%

Lazy-k Decoding: Constrained Decoding for Information Extraction

Hemmer,

Coustaty,

Bartolo

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

We explore the possibility of improving probabilistic models in structured prediction. Specifically, we combine the models with constrained decoding approaches in the context of token classification for information extraction. The decoding methods search for constraintsatisfying label-assignments while maximizing the total probability. To do this, we evaluate several existing approaches, as well as propose a novel decoding method called Lazy-k. Our findings demonstrate that constrained decoding approaches can significantly improve the models' performances, especially when using smaller models. The Lazy-k approach allows for more flexibility between decoding time and accuracy. The code for using Lazy-k decoding can be found here https://github.com/ ArthurDevNL/lazyk.

show abstract

LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

Cited by 5 publications

References 20 publications

An End-to-End Pipeline for Bibliography Extraction from Scientific Articles

An End-to-End Pipeline for Bibliography Extraction from Scientific Articles

Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Lazy-k Decoding: Constrained Decoding for Information Extraction

Contact Info

Product

Resources

About