2011 International Conference on Document Analysis and Recognition 2011
DOI: 10.1109/icdar.2011.57
|View full text |Cite
|
Sign up to set email alerts
|

Data Extraction from Web Tables: The Devil is in the Details

Abstract: Abstract-We present a method based on header paths for efficient and complete extraction of labeled data from tables meant for humans. Although many table configurations yield to the proposed syntactic analysis, some require access to semantic knowledge. Clicking on one or two critical cells per table, through a simple interface, is sufficient to resolve most of these problem tables. Header paths, a purely syntactic representation of visual tables, can be transformed ("factored") into existing representations … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
12
0

Year Published

2012
2012
2022
2022

Publication Types

Select...
5
2
2

Relationship

4
5

Authors

Journals

citations
Cited by 18 publications
(12 citation statements)
references
References 14 publications
(14 reference statements)
0
12
0
Order By: Relevance
“…Previous segmentation methods typically located the boundary between headers and data cells using heuristics based on cell content and appearance for distinguishing headers from data cells and the rest of the table (e.g. table title and footnotes) [4,5,6]. Such methods achieved 80-90% accuracy, but the formatting peculiarities causing the remaining errors vary enough to hamper further progress in this direction [7].…”
Section: Previous Workmentioning
confidence: 99%
“…Previous segmentation methods typically located the boundary between headers and data cells using heuristics based on cell content and appearance for distinguishing headers from data cells and the rest of the table (e.g. table title and footnotes) [4,5,6]. Such methods achieved 80-90% accuracy, but the formatting peculiarities causing the remaining errors vary enough to hamper further progress in this direction [7].…”
Section: Previous Workmentioning
confidence: 99%
“…Four critical cells that bound the stub-head and data regions completely define the segmentation [10]: CC1 and CC2 correspond to the top-left and bottom-right cells of the stub head; CC3 and CC4 correspond to the top-left and bottom-right cells of the data-cell region. If the stub head consists of a single cell, as in Fig.…”
Section: Introductionmentioning
confidence: 99%
“…These paths can be factored into canonical expressions to recover the Wang category trees of the headers [2]. With the canonical expression and table's data region indexed by the header paths, we can generate the corresponding relational table and populate it with data [10]. We can then query the .…”
Section: Introductionmentioning
confidence: 99%
“…The present work is part of the larger TANGO project, Table Analysis for Growing Ontologies 4 , where we addressed similar goals of information extraction and aggregation from tables 5,6 , attempted to formulate an analytical framework for characterizing tables 7 , proposed the notion of header paths 8 , and demonstrated an end-to-end table processing pipeline that yielded relational tables and 34,110 subject-predicate-object RDF triples from 200 tables 9 .…”
Section: Prior Workmentioning
confidence: 99%