Integrating structured data and text: A relational approach

Grossman, David A.; Frieder, Ophir; Holmes, David; Roberts, David C.

doi:10.1002/(sici)1097-4571(199702)48:2<122::aid-asi3>3.0.co;2-#

“…The idea to use DBMS technology as a building block in an IR system is pursued e.g., in [21], where the authors store inverted lists in a Microsoft SQLServer and use SQL queries for keyword search. Similarly, in [19] IR data is distributed over a PC cluster, and an analysis of the impact of concurrent updates is provided.…”

Section: Related Workmentioning

confidence: 99%

Flexible and efficient IR using array databases

Cornacchia

¹

,

Héman

²

,

Żukowski

³

et al. 2007

View full text Add to dashboard Cite

The Matrix Framework is a recent proposal by Information Retrieval (IR) researchers to flexibly represent information retrieval models and concepts in a single multidimensional array framework. We provide computational support for exactly this framework with the array database system SRAM (Sparse Relational Array Mapping), that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules. To demonstrate their effect on text retrieval, we apply them in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage.

show abstract

“…Grossman et al [13] present techniques for representing text doc- uments and their associated term frequencies in relational tables, as well as for mapping boolean and vector-space queries into standard SQL queries. They also use a query-pruning technique, based on word frequencies, to speed up query execution.…”

Section: Related Workmentioning

confidence: 99%

Text joins in an RDBMS for web data integration

Gravano¹,

Ipeirotis²,

Koudas³

et al. 2003

Proceedings of the Twelfth International Conference on World Wide Web - WWW '03

View full text Add to dashboard Cite

The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is also often noisy because of transcription errors, incomplete information, and lack of standard formats. A fundamental task during data integration is matching of strings that refer to the same entity.In this paper, we adopt the widely used and established cosine similarity metric from the information retrieval field in order to identify potential string matches across web sources. We then use this similarity metric to characterize this key aspect of data integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose a sampling-based join approximation strategy for execution in a standard, unmodified relational database management system (RDBMS), since more and more web sites are powered by RDBMSs with a web-based front end. We implement the join inside an RDBMS, using SQL queries, for scalability and robustness reasons.Finally, we present a detailed performance evaluation of an implementation of our algorithm within a commercial RDBMS, using real-life data sets. Our experimental results demonstrate the efficiency and accuracy of our techniques.

show abstract

“…Furthermore, each tuple in RiWeights consists of a tuple id tid, the actual token (i.e., q-gram in this case), and its associated weight. Then, if C bytes are needed to represent tid and weight, the total size of relation RiWeights will not exceed Given the relations R1Weights and R2Weights, a baseline approach [13,18] to compute R1 I φ R2 is shown in Figure 2. This SQL statement performs the text join by computing the similarity of each pair of tuples and filtering out any pair with similarity less than the similarity threshold φ.…”

Section: Tuple Weight Vectorsmentioning

confidence: 99%

Text joins in an RDBMS for web data integration

Gravano

¹

,

Ipeirotis

²

,

Koudas

³

et al. 2003

Proceedings of the Twelfth International Conference on World Wide Web - WWW '03

View full text Add to dashboard Cite

The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is also often noisy because of transcription errors, incomplete information, and lack of standard formats. A fundamental task during data integration is matching of strings that refer to the same entity.In this paper, we adopt the widely used and established cosine similarity metric from the information retrieval field in order to identify potential string matches across web sources. We then use this similarity metric to characterize this key aspect of data integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose a sampling-based join approximation strategy for execution in a standard, unmodified relational database management system (RDBMS), since more and more web sites are powered by RDBMSs with a web-based front end. We implement the join inside an RDBMS, using SQL queries, for scalability and robustness reasons.Finally, we present a detailed performance evaluation of an implementation of our algorithm within a commercial RDBMS, using real-life data sets. Our experimental results demonstrate the efficiency and accuracy of our techniques.

show abstract

Integrating structured data and text: A relational approach

Cited by 60 publications

References 12 publications

Flexible and efficient IR using array databases

Flexible and efficient IR using array databases

Text joins in an RDBMS for web data integration

Text joins in an RDBMS for web data integration

Contact Info

Product

Resources

About