Proceedings of the 26th International Conference on World Wide Web 2017
DOI: 10.1145/3038912.3052598
|View full text |Cite
|
Sign up to set email alerts
|

Consistent Weighted Sampling Made More Practical

Abstract: Min-Hash, which is widely used for efficiently estimating similarities of bag-of-words represented data, plays an increasingly important role in the era of big data. It has been extended to deal with real-value weighted sets-Improved Consistent Weighted Sampling (ICWS) is considered as the state-of-the-art for this problem. In this paper, we propose a Practical CWS (PCWS) algorithm. We first transform the original form of ICWS into an equivalent expression, based on which we find some interesting properties th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
30
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
2
2
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 33 publications
(30 citation statements)
references
References 19 publications
0
30
0
Order By: Relevance
“…Even though the Jaccard and minHash sketches are regularly used as a measure of the k-mer content similarity in computational biology software, the weighted Jaccard similarity has been heavily studied and used in other contexts, such as large database document classification and retrieval (e.g., Manasse et al, 2010;Shrivastava, 2016;Wu et al, 2017), near duplicate image detection (Chum et al, 2008), duplicate news story detection (Alonso et al, 2013), source code deduplication (Markovtsev and Kant, 2017), time series indexing (Luo and Shrivastava, 2017), hierarchical topic extraction (Gollapudi and Panigrahy, 2006), or malware classifcation (Drew et al, 2017) and detection (Raff and Nicholas, 2017).…”
Section: Weighted Jaccard and Omhmentioning
confidence: 99%
“…Even though the Jaccard and minHash sketches are regularly used as a measure of the k-mer content similarity in computational biology software, the weighted Jaccard similarity has been heavily studied and used in other contexts, such as large database document classification and retrieval (e.g., Manasse et al, 2010;Shrivastava, 2016;Wu et al, 2017), near duplicate image detection (Chum et al, 2008), duplicate news story detection (Alonso et al, 2013), source code deduplication (Markovtsev and Kant, 2017), time series indexing (Luo and Shrivastava, 2017), hierarchical topic extraction (Gollapudi and Panigrahy, 2006), or malware classifcation (Drew et al, 2017) and detection (Raff and Nicholas, 2017).…”
Section: Weighted Jaccard and Omhmentioning
confidence: 99%
“…k-mer frequencies). To overcome this, histosketching employs CWS to account for element frequency and approximate the generalised Jaccard similarity between weighted sets, without splitting each weighted element into sub-elements and computing independent hash values (quantization) (Haveliwala et al , 2000;Manasse et al , 2010;Ioffe, 2010;Wu et al , 2017) .…”
Section: Consistent Weighted Samplingmentioning
confidence: 99%
“…The sample is uniformly sampled from ∪ k {k} × [0, W k ], meaning that the probability of selecting k is proportional to the k-mer frequency, W k , and y is uniformly distributed on [0, W k ]. The sample is also consistent as given two weighted sets, W1 and W2, if ∀k, W1 k ≤ W2 k , a subelement (k, y k ) is selected from W1 and satisfies y k ≤ W2 k , then (k, y k ) will also be selected from W2 (Ioffe, 2010;Wu et al , 2017) .…”
Section: Consistent Weighted Samplingmentioning
confidence: 99%
See 1 more Smart Citation
“…Its basic idea is to maintain a set of compact sketches of the original high dimensional data to efficiently approximate their similarities, such as Jaccard [13], [14], cosine [15], and min-max [20], [16], [28], [8], [29] similarities. These sketches can then enable many applications, particularly for information retrieval systems like image or document search engines [26].…”
Section: Related Workmentioning
confidence: 99%