2017 IEEE International Conference on Data Mining (ICDM) 2017
DOI: 10.1109/icdm.2017.64
|View full text |Cite
|
Sign up to set email alerts
|

HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms with Concept Drift

Abstract: Abstract-Histogram-based similarity has been widely adopted in many machine learning tasks. However, measuring histogram similarity is a challenging task for streaming data, where the elements of a histogram are observed in a streaming manner. First, the ever-growing cardinality of histogram elements makes any similarity computation inefficient. Second, the concept-drift issue in the data streams also impairs the accurate assessment of the similarity. In this paper, we propose to overcome the above challenges … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
43
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 34 publications
(45 citation statements)
references
References 30 publications
0
43
0
Order By: Relevance
“…However, rather than computing and storing a full k-mer spectrum after reading the sequence data, which is resource intensive (in terms of memory or disk space), we use the recently proposed histosketch data structure to maintain a set of fixed size sketches to approximate the overall k-mer spectrum as it is received from a data stream (Yang et al , 2017) . The histosketch has two properties making it suitable for this application, i .…”
Section: Histosketching Microbiome Datamentioning
confidence: 99%
See 4 more Smart Citations
“…However, rather than computing and storing a full k-mer spectrum after reading the sequence data, which is resource intensive (in terms of memory or disk space), we use the recently proposed histosketch data structure to maintain a set of fixed size sketches to approximate the overall k-mer spectrum as it is received from a data stream (Yang et al , 2017) . The histosketch has two properties making it suitable for this application, i .…”
Section: Histosketching Microbiome Datamentioning
confidence: 99%
“…We view the k-mer spectrum as a histogram, where k-mers from a microbiome sample are hashed uniformly across N bins and the frequency value of a bin corresponds to observed k-mer frequency. In order to incorporate both the bin and frequency (a weighted set) into the histosketch, we employ Consistent Weighted Sampling (CWS) to generate hash values for each histogram element, which ensures that the computational complexity of hashing is independent of bin frequency (Ioffe, 2010;Yang et al , 2017) .…”
Section: Histosketching Microbiome Datamentioning
confidence: 99%
See 3 more Smart Citations