Abstract:All Pairs Similarity Search (AP SS) is a ubiquitous problem in many data mining applications and involves finding all pairs of records with similarity scores above a specified threshold. In this paper, we introduce the problem of Incremental All Pairs Similarity Search (IAP SS), where AP SS is performed multiple times over the same dataset by varying the similarity threshold. To the best of our knowledge, this is the first work that addresses the IAP SS problem. All existing solutions for AP SS perform redunda… Show more
“…The above studies focus on finding binary or non-binary pairs with some specific similarity measures above some given thresholds. Recently, Awekar et al [4] studied the problem of searching candidate pairs incrementally for varying similarity thresholds. Xiao et al [35] studied the top-K set similarity joins problem for near duplicate detection, which enumerated all the "necessary" similarity thresholds in the decreasing order until the top-K set had been found.…”
Section: Mining Interesting Patternsmentioning
confidence: 99%
“…In other words, in the initial stage, we push P [1,2] and P [2,3] (P [i, j] is the pair of item [i] and item [j], given i≤j) into the top-2 list, and compute their cosine values. Then, in the updating stage, we traverse along the diagonals (denoted by the dash-dotted line) in the sorted item-matrix to check in sequence whether P [3,4] , P [4,5] , P [5,6] , P [4,6] , P [3,5] ,…, P [1,6] can enter the top-2 list, as shown in Fig. 1.…”
Section: 222mentioning
confidence: 99%
“…For example, for the sorted item-matrix in Fig. 1, if P [3,4] cannot enter the top-2 list for upper(cos(P [3,4] )) ≤ minCos, then all the pairs in the upper right corner of P [3,4] will also fail to enter the list, as shown by the shadowed area in Fig. 1.…”
Section: Theorem 1 Given the Current Top-k List And Its Mincos In A mentioning
confidence: 99%
“…The vector of "stage 2" shows the updated values. Next, suppose P [4,7] is the third pair with upper(cos(P [4,7] )) ≤ minCos, the boundary vector will be further updated to the one of "stage 3" accordingly. Now, given the asymptotic boundary vector above, we have the following criterion to decide whether an item pair should be pruned or not.…”
Section: Boundary Vector For the Pruning Statusmentioning
confidence: 99%
“…6. end For example, in the above case, after the traversal of the third diagonal, since the only one not pruned item pair P [4,7] has cosine upper bound less than minCos, we can safely stop our searching and return the current top-2 list as the final result. And the final boundary vector, i.e., the one of "stage 3", is indicated by the shaded areas of Fig.…”
“…The above studies focus on finding binary or non-binary pairs with some specific similarity measures above some given thresholds. Recently, Awekar et al [4] studied the problem of searching candidate pairs incrementally for varying similarity thresholds. Xiao et al [35] studied the top-K set similarity joins problem for near duplicate detection, which enumerated all the "necessary" similarity thresholds in the decreasing order until the top-K set had been found.…”
Section: Mining Interesting Patternsmentioning
confidence: 99%
“…In other words, in the initial stage, we push P [1,2] and P [2,3] (P [i, j] is the pair of item [i] and item [j], given i≤j) into the top-2 list, and compute their cosine values. Then, in the updating stage, we traverse along the diagonals (denoted by the dash-dotted line) in the sorted item-matrix to check in sequence whether P [3,4] , P [4,5] , P [5,6] , P [4,6] , P [3,5] ,…, P [1,6] can enter the top-2 list, as shown in Fig. 1.…”
Section: 222mentioning
confidence: 99%
“…For example, for the sorted item-matrix in Fig. 1, if P [3,4] cannot enter the top-2 list for upper(cos(P [3,4] )) ≤ minCos, then all the pairs in the upper right corner of P [3,4] will also fail to enter the list, as shown by the shadowed area in Fig. 1.…”
Section: Theorem 1 Given the Current Top-k List And Its Mincos In A mentioning
confidence: 99%
“…The vector of "stage 2" shows the updated values. Next, suppose P [4,7] is the third pair with upper(cos(P [4,7] )) ≤ minCos, the boundary vector will be further updated to the one of "stage 3" accordingly. Now, given the asymptotic boundary vector above, we have the following criterion to decide whether an item pair should be pruned or not.…”
Section: Boundary Vector For the Pruning Statusmentioning
confidence: 99%
“…6. end For example, in the above case, after the traversal of the third diagonal, since the only one not pruned item pair P [4,7] has cosine upper bound less than minCos, we can safely stop our searching and return the current top-2 list as the final result. And the final boundary vector, i.e., the one of "stage 3", is indicated by the shaded areas of Fig.…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.