Proceedings of the 16th International Conference on World Wide Web 2007
DOI: 10.1145/1242572.1242583
|View full text |Cite
|
Sign up to set email alerts
|

Towards domain-independent information extraction from web tables

Abstract: Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of tags. A multitude of different HTML implementations of web tables make these approaches difficult to scale. In this paper, we approach the problem of domain-independent information extraction from web tables by shifting our attention from the tree-based representation of web pages to a variation of the two-dimensional visual box model used by web brow… Show moreHelp me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
148
0

Year Published

2007
2007
2022
2022

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 178 publications
(152 citation statements)
references
References 39 publications
0
148
0
Order By: Relevance
“…(Indeed, we generated the FOCIH form in Figure 4 with this implemented system.) Other table-interpretation systems (e.g., [13,16,25]) could also be used as front-end processors for generating FOCIH forms. Moreover, tables are not the only front-end structures from which we can derive forms.…”
Section: Further Reduction Of Labor-intensive Tasksmentioning
confidence: 99%
“…(Indeed, we generated the FOCIH form in Figure 4 with this implemented system.) Other table-interpretation systems (e.g., [13,16,25]) could also be used as front-end processors for generating FOCIH forms. Moreover, tables are not the only front-end structures from which we can derive forms.…”
Section: Further Reduction Of Labor-intensive Tasksmentioning
confidence: 99%
“…These simple assumptions (labels are either the first row or the first column) are easily broken in complex tables. More sophisticated table interpretation techniques have appeared in recent papers [8,9,11]. None of this research makes use of sibling tables, but is complementary to our work and could potentially be used in conjunction with our work in future efforts to improve results for certain cases.…”
Section: Introductionmentioning
confidence: 97%
“…Extraction of data unstructured data is very difficult structured data is in the form of HTML and XML language which contains tag such as <ul>, <li>, <table>. [2] [4] [5] How we know the extracted data from list and tables is valuable or not. The quantity of data available on web is dilated.…”
Section: Introductionmentioning
confidence: 99%