GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search

In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset S, we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over S. The problem has many important applications in commercial fields and scientific studies. To the best of our knowledge, this is the first work to study this important problem. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch based approach IL-GKMV. We analyse that the performance of IL-GKMV degrades with the increase of vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance performance, a divide-and-conquer based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. We theoretically analyse the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on 6 real datasets verify the effectiveness and efficiency of our proposed techniques.

show abstract

“…The estimation variance by G-KMV method is smaller than that of simple KMV method under reasonable assumptions as analysed in [31].…”

Section: Kmv Synopsesmentioning

confidence: 92%

Selectivity Estimation on Set Containment Search

Yang

Zhang

et al. 2019

Database Systems for Advanced Applications

Self Cite

View full text Add to dashboard Cite

show abstract

“…The estimation variance by G-KMV method is smaller than that of simple KMV method under reasonable assumptions as analyzed in [35].…”

Section: Kmv Synopsesmentioning

confidence: 93%

Selectivity Estimation on Set Containment Search

Yang

Zhang

et al. 2019

Data Sci. Eng.

Self Cite

View full text Add to dashboard Cite

In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset S , we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over S. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketchbased approach IL-GKMV. We analyze that the performance of IL-GKMV degrades with the increase in vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure-based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance the performance, a divide-and-conquer-based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. Meanwhile, we consider weighted set containment selectivity estimation and devise stratified random sampling approach named StrRS. We theoretically analyze the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on nine real datasets verify the effectiveness and efficiency of our proposed techniques.

show abstract

“…Recent works [22,71] incorporate ideas similar to the strategy used in this paper and in KMV sketches family: they use a random hashing function to map join values to the unit range and then select tuples based on some selection strategy. For instance, the strategy adopted by the correlated sampling algorithm [71] is equivalent to the strategy of the G-KMV sketch [77], where tuples are selected if the hashed keys are smaller than a probability threshold. In contrast, Correlation Sketches includes tuples in the sketch up to a fixed number, which avoids assigning too much space to large datasets and leads to more predictable performance for query evaluation.…”

Section: Related Workmentioning

confidence: 99%

“…Recent research proposes methods that support dataset-oriented queries to retrieve datasets that can be concatenated [56] or joined with a given dataset [20,77,84]. However, neither supports the discovery tasks illustrated in the examples above.…”

Section: Introductionmentioning

confidence: 99%

Correlation Sketches for Approximate Join-Correlation Queries

Santos

Bessa

Chirigati

et al. 2021

Proceedings of the 2021 International Conference on Management of Data

View full text Add to dashboard Cite

The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column and a join column from a query table T , retrieve tables T in a dataset collection such that T is joinable with T on and there is a column ∈ T such that is correlated with . A naïve approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for joincorrelation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings. CCS CONCEPTS• Information systems → Data management systems.

show abstract

GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search

Cited by 19 publications

References 44 publications

Selectivity Estimation on Set Containment Search

Selectivity Estimation on Set Containment Search

Selectivity Estimation on Set Containment Search

Correlation Sketches for Approximate Join-Correlation Queries

Contact Info

Product

Resources

About