Pivot-based metric indexing

Lu, Chen; Gao, Yunjun; Zheng, Baihua; Jensen, Christian S.; Yang, Hanyu; Yang, Keyu

doi:10.14778/3115404.3115411

Cited by 51 publications

(43 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Similarity searches are usually modeled after the Metric Spaces Model [Hetland 2009;Chen et al 2017] due to its (i) distance-based semantics [Zezula et al 2010] and (ii) computational-bounded complexity [Hetland 2020]. Formally, a metric space is a pair O, δ of a data domain O and a distance function δ, which complies with the following properties for any objects o q , o i , o j ∈ O.…”

Section: Similarity Searchingmentioning

confidence: 99%

“…• 291 index-and-search algorithms [Hjaltason and Samet 2003;Chen et al 2017;Hetland 2020], but they may present a semantic drawback for searching dense datasets. For instance, suppose a composer runs a similarity search for the five most similar tunes to the "Beatles Hey Jude" in a social-network repository and retrieves versions and parodies of the same song.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An empirical assessment of quality metrics for diversified similarity searching

Lopes¹,

Santos²,

Jasbick

et al. 2021

jidm

View full text Add to dashboard Cite

A diversified similarity search retrieves elements that are simultaneously similar to a query object and akin to the different collections within the explored data. While several methods in information retrieval, data clustering, and similarity searching have tackled the problem of adding diversity into result sets, the experimental comparison of their performances is still an open issue mainly because the quality metrics are “borrowed” from those different research areas, bringing their biases alongside. In this manuscript, we investigate a series of such metrics and experimentally discuss their trends and limitations. We conclude diversity is better addressed by a set of measures rather than a single quality index and introduce the concept of Diversity Features Model (DFM), which combines the viewpoints of biased metrics into a multidimensional representation. Experimental evaluations indicate (i) DFM enables comparing different result diversification algorithms by considering multiple criteria, and (ii) the most suitable searching methods for a particular dataset are spotted by combining DFM with ranking aggregation and parallel coordinates maps.

show abstract

Section: Similarity Searchingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

An empirical assessment of quality metrics for diversified similarity searching

Lopes¹,

Santos²,

Jasbick

et al. 2021

jidm

View full text Add to dashboard Cite

show abstract

“…As a well-known problem in data mining, the purpose of string similarity search is to find all strings within a given edit distance from the query string in a set of strings [12], [13], [14], [15], [16], [17]. However, most related researches focus on building the index of a fixed size set of strings to improve the performance of query [14], [15], [16], [17]. Only a few works have been done on data stream, and most of them focus on time series [18], [19], and most use the sliding window model [12] which is apparently different from the landmark model used in this paper.…”

Section: A Related Workmentioning

confidence: 99%

A Novel Method to Prevent Misconfigurations of Industrial Automation and Control Systems

Zhang

et al. 2021

IEEE Trans. Ind. Inf.

View full text Add to dashboard Cite

show abstract

“…Afrati et al [29] proposed multiple algorithms to perform a fuzzy join with Hamming, Edit and Jaccard distance in a single MapReduce stage without filters. Other algorithms [30]- [32] use pivots to split data into disjoint partitions by recursive jobs.…”

Section: Introductionmentioning

confidence: 99%

Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce

Tran

Phan

Laurent

et al. 2020

2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)

View full text Add to dashboard Cite

A fuzzy or similarity join is one of the most useful data processing and analysis operations for Big Data in a general context. It combines pairs of tuples for which the distance is lower than or equal to a given threshold ε. The fuzzy join is used in many practical applications, but it is extremely costly in time and space, and may even not be executed on large-scale datasets. Although there have been some studies to improve its performance by applying filters, a solution of an effective fuzzy filter for the join has never been conducted. In this paper, we thus extend our previous work by proposing a novel fuzzy filter to optimize fuzzy joins. This filter is a compact, probabilistic data structure that supports very fast similarity queries by maintaining a bit matrix, with small false positive rate and zero false negative rate. We show that our proposal is more efficient than others because of eliminating redundant data, reducing computation cost and avoiding duplicate output.

show abstract

Pivot-based metric indexing

Cited by 51 publications

References 30 publications

An empirical assessment of quality metrics for diversified similarity searching

An empirical assessment of quality metrics for diversified similarity searching

A Novel Method to Prevent Misconfigurations of Industrial Automation and Control Systems

Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce

Contact Info

Product

Resources

About