Data Extraction from Web Tables: The Devil is in the Details

2014 22nd International Conference on Pattern Recognition

2014

Self Cite

Abstract-HTML tables represent a significant fraction of web data. The often complex headers of such tables are determined accurately using their indexing property. Isolated headers are factored to extract hierarchical categories. Web tables are then transformed into a canonical form and imported into a relational database. The proposed processing allows for the formulation of arbitrary SQL queries over the collection of induced relational tables.

Section: Previous Workmentioning

confidence: 99%

Transforming Web Tables to a Relational Database

Embley

Seth

2014 22nd International Conference on Pattern Recognition

2014

Self Cite

“…Four critical cells that bound the stub-head and data regions completely define the segmentation [10]: CC1 and CC2 correspond to the top-left and bottom-right cells of the stub head; CC3 and CC4 correspond to the top-left and bottom-right cells of the data-cell region. If the stub head consists of a single cell, as in Fig.…”

Section: Introductionmentioning

confidence: 99%

“…These paths can be factored into canonical expressions to recover the Wang category trees of the headers [2]. With the canonical expression and table's data region indexed by the header paths, we can generate the corresponding relational table and populate it with data [10]. We can then query the .…”

Section: Introductionmentioning

confidence: 99%

Segmenting Tables via Indexing of Value Cells by Table Headers

Seth

2013 12th International Conference on Document Analysis and Recognition

2013

Self Cite

Abstract-Correct segmentation of a web table into its component regions is the essential first step to understanding tabular data. Our algorithmic solution to the segmentation problem relies on the property that strings defining row and column header paths uniquely index each data cell in the table. We segment the table using only "logical layout analysis" without resorting to any appearance features or natural language understanding. We start with a CSV

“…The present work is part of the larger TANGO project, Table Analysis for Growing Ontologies 4 , where we addressed similar goals of information extraction and aggregation from tables 5,6 , attempted to formulate an analytical framework for characterizing tables 7 , proposed the notion of header paths 8 , and demonstrated an end-to-end table processing pipeline that yielded relational tables and 34,110 subject-predicate-object RDF triples from 200 tables 9 .…”

Section: Prior Workmentioning

confidence: 99%

VeriClick: an efficient tool for table format verification

Tamhankar

2012

SPIE Proceedings

Self Cite

The essential layout attributes of a visual table can be defined by the location of four critical grid cells. Although these critical cells can often be located by automated analysis, some means of human interaction is necessary for correcting residual errors. VeriClick is a macro-enabled spreadsheet interface that provides ground-truthing, confirmation, correction, and verification functions for CSV tables. All user actions are logged. Experimental results of seven subjects on one hundred tables suggest that VeriClick can provide a ten-to twenty-fold speedup over performing the same functions with standard spreadsheet editing commands.