Harvesting relational tables from lists on the web

Elmeleegy, Hazem; Madhavan, Jayant; Halevy, Alon

doi:10.14778/1687627.1687749

Cited by 72 publications

(43 citation statements)

References 26 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years, harvesting knowledge from the web [11,24,25,28] has attracted more and more attention. For example, Google's Freebase [1] has collected and published more than 39 million real world entities, with more than 140, 000 attributes.…”

Section: Proceedings Of the Vldbmentioning

confidence: 99%

An efficient publish/subscribe index for e-commerce databases

2014

View full text Add to dashboard Cite

Many of today's publish/subscribe (pub/sub) systems have been designed to cope with a large volume of subscriptions and high event arrival rate (velocity). However, in many novel applications (such as e-commerce), there is an increasing variety of items, each with different attributes. This leads to a very high-dimensional and sparse database that existing pub/sub systems can no longer support effectively. In this paper, we propose an efficient in-memory index that is scalable to the volume and update of subscriptions, the arrival rate of events and the variety of subscribable attributes. The index is also extensible to support complex scenarios such as prefix/suffix filtering and regular expression matching. We conduct extensive experiments on synthetic datasets and two real datasets (AOL query log and Ebay products). The results demonstrate the superiority of our index over state-of-the-art methods: our index incurs orders of magnitude less index construction time, consumes a small amount of memory and performs event matching efficiently.

show abstract

Section: Proceedings Of the Vldbmentioning

confidence: 99%

An efficient publish/subscribe index for e-commerce databases

2014

View full text Add to dashboard Cite

show abstract

“…One distinguishable feature of our work is the ability to gather and leverage domain knowledge at runtime to automatically tune the integration process. The massive exploitation of the structured Web has been studied for data published in HTML tables and lists [10,20]. However, these works focus on the extraction of rich relational schemas, without addressing the issue of integrating the extracted data.…”

Section: Related Workmentioning

confidence: 99%

Extraction and integration of partially overlapping web sources

et al. 2013

View full text Add to dashboard Cite

We present an unsupervised approach for harvesting the data exposed by a set of structured and partially overlapping data-intensive web sources. Our proposal comes within a formal framework tackling two problems: the data extraction problem, to generate extraction rules based on the input websites, and the data integration problem, to integrate the extracted data in a unified schema. We introduce an original algorithm, WEIR, to solve the stated problems and formally prove its correctness. WEIR leverages the overlapping data among sources to make better decisions both in the data extraction (by pruning rules that do not lead to redundant information) and in the data integration (by reflecting local properties of a source over the mediated schema). Along the way, we characterize the amount of redundancy needed by our algorithm to produce a solution, and present experimental results to show the benefits of our approach with respect to existing solutions.

show abstract

“…Traditional IE techniques considered in the database community tend to be source-centric, i.e., they can only be deployed to extract from a specific website or data source. However, a range of domain-independent techniques have emerged recently [2,4,9,10,11,16,20] that seek to look at extraction holistically on the entire Web.…”

Section: Introductionmentioning

confidence: 99%

“…There are some domain-independent efforts, e.g. WebTables [4,10], that extract all simple tables and lists from the Web and store them as relational data. However, domain-independence makes it difficult to attach semantics to the extracted data.…”

Section: Introductionmentioning

confidence: 99%

An analysis of structured data on the web

2012

View full text Add to dashboard Cite

In this paper, we analyze the nature and distribution of structured data on the Web. Web-scale information extraction, or the problem of creating structured tables using extraction from the entire web, is gathering lots of research interest. We perform a study to understand and quantify the value of Web-scale extraction, and how structured information is distributed amongst top aggregator websites and tail sites for various interesting domains. We believe this is the first study of its kind, and gives us new insights for information extraction over the Web.

show abstract

Harvesting relational tables from lists on the web

Cited by 72 publications

References 26 publications

An efficient publish/subscribe index for e-commerce databases

An efficient publish/subscribe index for e-commerce databases

Extraction and integration of partially overlapping web sources

An analysis of structured data on the web

Contact Info

Product

Resources

About