WebSRC: A Dataset for Web-Based Structural Reading Comprehension

Chen, Lu; Chen, Xingyu; Zhao, Zihan; Zhang, Danyang; Ji, Jiabao; Luo, A-Li; Xiong, Yuxuan; Yu, Kai

doi:10.18653/v1/2021.emnlp-main.343

Cited by 26 publications

(25 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This section provides discussion that connects WebFormer with previous methods as well as the limitations of our model. If we treat HTML tags as additional text tokens, and combine with the text into a single sequence without the H2H, H2T and T2H attentions, our model architecture degenerates to the sequence modeling approaches [9,51] that serialize the HTML layout. If we further trim the HTML from the sequence, our model is regressed to the sequence model [47] that only uses the text information.…”

Section: Discussionmentioning

confidence: 99%

“…Recently, there has been an increasing number of works that develop natural language models with sequence modeling [9,20,26,30,34,61] for web information extraction. Zheng et al [59] develop an end-to-end tagging model utilizing BiLSTM, CRF, and attention mechanism without any dictionary.…”

Section: Related Work 21 Information Extractionmentioning

confidence: 99%

“…More recently, several attribute extraction approaches [47,49,53] have been proposed, which treat each field as an attribute of interest and extract its corresponding value from clean object context such as web title. Chen et al [9] formulate the web information extraction problem as structural reading comprehension and build a BERT [15] based model to extract structured fields from the web documents. It is worth mentioning that there are also methods that work on multimodal information extraction [44,45,48,55], which focus on extracting the field information from the visual layout or the rendered HTML of the web documents.…”

Section: Related Work 21 Information Extractionmentioning

confidence: 99%

“…Existing sequence modeling methods either directly model the text sequence from web document [26,47] or serialize the HTML with the text in a certain order [9,61] to perform the span based text extraction. In this work, we propose to simultaneously encode the text sequence using the Transformer model and incorporate the HTML layout structure with graph attention.…”

Section: Approach Overviewmentioning

confidence: 99%

“…SimpDOM [61] treats the problem as DOM tree node tagging task by extracting the features for each text node including XPath, and uses a LSTM to jointly encode with the text features. H-PLM [9] sequentializes the HTML together with the text and builds a sequence model using the pre-training ELECTRA [11] as backbone.…”

Section: Baselinesmentioning

confidence: 99%

See 4 more Smart Citations

WebFormer: The Web-page Transformer for Structure Information Extraction

Wang¹,

Fang²,

Ravula³

et al. 2022

Preprint

View full text Add to dashboard Cite

Structure information extraction refers to the task of extracting structured text fields from web pages, such as extracting a product offer from a shopping page including product title, description, brand and price. It is an important research topic which has been widely studied in document understanding and web search. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. However, effectively serializing tokens from unstructured web pages is challenging in practice due to a variety of web layout patterns. Limited work has focused on modeling the web layout for extracting the text fields. In this paper, we introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents. First, we design HTML tokens for each DOM node in the HTML by embedding representations from their neighboring tokens through graph attention. Second, we construct rich attention patterns between HTML tokens and text tokens, which leverages the web layout for effective attention weight computation. We conduct an extensive set of experiments on SWDE and Common Crawl benchmarks. Experimental results demonstrate the superior performance of the proposed approach over several state-of-the-art methods. CCS CONCEPTS• Computing methodologies → Information extraction.

show abstract

Section: Discussionmentioning

confidence: 99%