Text joins in an RDBMS for web data integration

Gravano, Luis; Ipeirotis, Panagiotis G.; Koudas, Nick; Srivastava, Divesh

doi:10.1145/775152.775166

Cited by 115 publications

(53 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…K m represents a possible join query (each relation node in the tree, or connected to a node in the tree by a zero-cost edge, represents a query atom, and each nonzerocost edge represents a join or selection condition). Like most keyword search-over-database systems, Q generates queries that may produce relevant answers by running an approximate Steiner tree algorithm [43] to connect matching nodes in the search graph with the lowest-cost tree, and executes them and unions their results together in ranked order using a top-k query processing algorithm [16,21,27]. While the Q system combines cost components (features) derived from data as well as metadata, in this paper we focus on features that are associated with the metadata and the query-particularly those having to do with predicted schema matches-rather than those derived from specific fields in the data.…”

Section: Search and Rankingmentioning

confidence: 99%

Active learning in keyword search-based data integration

et al. 2015

View full text Add to dashboard Cite

The problem of scaling up data integration, such that new sources can be quickly utilized as they are discovered, remains elusive: Global schemas for integrated data are difficult to develop and expand, and schema and record matching techniques are limited by the fact that data and metadata are often under-specified and must be disambiguated by data experts. One promising approach is to avoid using a global schema, and instead to develop keyword search-based data integration-where the system lazily discovers associations enabling it to join together matches to keywords, and return ranked results. The user is expected to understand the data domain and provide feedback about answers' quality. The system generalizes such feedback to learn how to correctly integrate data. A major open challenge is that under this model, the user only sees and offers feedback on a few "top-k" results: This result set must be carefully selected to include answers of high relevance and answers that are highly informative when feedback is given on them.Existing systems merely focus on predicting relevance, by composing the scores of various schema and record matching algorithms. In this paper, we show how to predict the uncertainty associated with a query result's score, as well as how informative feedback is on a given result. We build upon these foundations to develop an active learning approach to keyword search-based data integration, and we validate the effectiveness of our solution over real data from several very different domains.

show abstract

Section: Search and Rankingmentioning

confidence: 99%

Active learning in keyword search-based data integration

et al. 2015

View full text Add to dashboard Cite

show abstract

“…This technique is also very useful to data integration applications. A special case is the approximate join operator [17,18,39] which matches records from different files according to the degree of similarity between their fields.…”

Section: Related Workmentioning

confidence: 99%

Automatic threshold estimation for data matching applications

Santos¹,

Heuser²,

Moreira³

et al. 2011

Information Sciences

View full text Add to dashboard Cite

“…In [9,16,17], they present how to declaratively integrate similarity functions to the DBMS using an SQL interface and perform entity extraction tasks. Unfortunately, no single similarity function can always outperform the others.…”

Section: Related Workmentioning

confidence: 99%

Approximate entity extraction in temporal databases

Lü

Fung

et al. 2011

World Wide Web

View full text Add to dashboard Cite

We study the problem of efficiently extracting K entities, in a temporal database, which are most similar to a given search query. This problem is well studied in relational databases, where each entity is represented as a single record and there exist a variety of methods to define the similarity between a record and the search query. However, in temporal databases, each entity is represented as a sequence of historical records. How to properly define the similarity of each entity in the temporal 158 World Wide Web (2011) 14:157-186 database still remains an open problem. The main challenging is that, when a user issues a search query for an entity, he or she is prone to mix up information of the same entity at different time points. As a result, methods, which are used in relational databases based on record granularity, cannot work any further. Instead, we regard each entity as a set of "virtual records", where attribute values of a "virtual record" can be from different records of the same entity. In this paper, we propose a novel evaluation model, based on which the similarity between each "virtual record" and the query can be effectively quantified, and the maximum similarity of its "virtual records" is taken as the similarity of an entity. For each entity, as the number of its "virtual records" is exponentially large, calculating the similarity of the entity is challenging. As a result, we further propose a Dominating Tree Algorithm (DTA), which is based on the bounding-pruning-refining strategy, to efficiently extract K entities with greatest similarities. We conduct extensive experiments on both real and synthetic datasets. The encouraging results show that our model for defining the similarity between each entity and the search query is effective, and the proposed DTA can perform at least two orders of magnitude improvement on the performance comparing with the naive approach.

show abstract

Text joins in an RDBMS for web data integration

Cited by 115 publications

References 17 publications

Active learning in keyword search-based data integration

Active learning in keyword search-based data integration

Automatic threshold estimation for data matching applications

Approximate entity extraction in temporal databases

Contact Info

Product

Resources

About