Towards domain-independent information extraction from web tables

Gatterbauer, Wolfgang; Bohunsky, Paul; Herzog, Marcus; Krüpl, Bernhard; Pollak, Bernhard

doi:10.1145/1242572.1242583

Cited by 178 publications

(152 citation statements)

References 39 publications

Supporting

Mentioning

148

Contrasting

Order By: Relevance

“…(Indeed, we generated the FOCIH form in Figure 4 with this implemented system.) Other table-interpretation systems (e.g., [13,16,25]) could also be used as front-end processors for generating FOCIH forms. Moreover, tables are not the only front-end structures from which we can derive forms.…”

Section: Further Reduction Of Labor-intensive Tasksmentioning

confidence: 99%

FOCIH: Form-Based Ontology Creation and Information Harvesting

Tao

Embley

Liddle

2009

Conceptual Modeling - ER 2009

View full text Add to dashboard Cite

Abstract.Creating an ontology and populating it with data are both labor-intensive tasks requiring a high degree of expertise. Thus, scaling ontology creation and population to the size of the web in an effort to create a web of data-which some see as Web 3.0-is prohibitive. Can we find ways to streamline these tasks and lower the barrier enough to enable Web 3.0? Toward this end we offer a form-based approach to ontology creation that provides a way to create Web 3.0 ontologies without the need for specialized training. And we offer a way to semi-automatically harvest data from the current web of pages for a Web 3.0 ontology. In addition to harvesting information with respect to an ontology, the approach also annotates web pages and links facts in web pages to ontological concepts, resulting in a web of data superimposed over the web of pages. Experience with our prototype system shows that mappings between conceptual-model-based ontologies and forms are sufficient for creating the kind of ontologies needed for Web 3.0, and experiments with our prototype system show that automatic harvesting, automatic annotation, and automatic superimposition of a web of data over a web of pages work well. Keywords: ontology generation from forms, information harvesting from the web, automatic annotation of web pages, web of data, Web 3.0.

show abstract

Section: Further Reduction Of Labor-intensive Tasksmentioning

confidence: 99%

FOCIH: Form-Based Ontology Creation and Information Harvesting

Tao

Embley

Liddle

2009

Conceptual Modeling - ER 2009

View full text Add to dashboard Cite

show abstract

“…These simple assumptions (labels are either the first row or the first column) are easily broken in complex tables. More sophisticated table interpretation techniques have appeared in recent papers [8,9,11]. None of this research makes use of sibling tables, but is complementary to our work and could potentially be used in conjunction with our work in future efforts to improve results for certain cases.…”

Section: Introductionmentioning

confidence: 97%

Automatic Hidden-Web Table Interpretation by Sibling Page Comparison

Tao¹,

Embley²

2007

Conceptual Modeling - ER 2007

View full text Add to dashboard Cite

Abstract. The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction and semi-structured data management. In this paper, we offer a conceptual modeling solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. We compare them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2,000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns, if necessary, as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%.

show abstract

“…Extraction of data unstructured data is very difficult structured data is in the form of HTML and XML language which contains tag such as <ul>, <li>, <table>. [2] [4] [5] How we know the extracted data from list and tables is valuable or not. The quantity of data available on web is dilated.…”

Section: Introductionmentioning

confidence: 99%

Efficient Extraction of Top-k Instances from Web

Shinde¹,

Shewale²

2017

International Advanced Research Journal in Science, Engineering

View full text Add to dashboard Cite

Finding proper information from web pages is very difficult. Because we face problems such as most of available data contains unnecessary information such as some product advertisements, Facebook or twitter posts. One more problem is, obtained data is not in structured format. To overcome these problems, we introduce a system which mainly focuses on extracting exact information in top-k list format. List data is very eventful source to retrieving information. This paper work on information extraction from top-k web pages which contains top-k instances for open domain knowledge based. For example-"Top 10 IT companies in India". As compare to structured information from web, Top-k list data is cleaner and ranked. Top-k data has interesting semantics. We propose a system which gives direct top-k list when user enters a search query within minimum time. Extraction of top-k list depends on 1] Extracting web URLs and its titles 2] Removing dust from web URLs 3] Using extraction algorithm extract exact top-k list.

show abstract

Towards domain-independent information extraction from web tables

Cited by 178 publications

References 39 publications

FOCIH: Form-Based Ontology Creation and Information Harvesting

FOCIH: Form-Based Ontology Creation and Information Harvesting

Automatic Hidden-Web Table Interpretation by Sibling Page Comparison

Efficient Extraction of Top-k Instances from Web

Contact Info

Product

Resources

About