2018
DOI: 10.1007/978-3-319-93040-4_14
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Compression Technique for Sparse Sets

Abstract: Recent technological advancements have led to the generation of huge amounts of data over the web, such as text, image, audio and video. Needless to say, most of this data is high dimensional and sparse, consider, for instance, the bag-of-words representation used for representing text. O en, an e cient search for similar data points needs to be performed in many applications like clustering, nearest neighbour search, ranking and indexing. Even though there have been signi cant increases in computational power… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2

Relationship

3
4

Authors

Journals

citations
Cited by 8 publications
(10 citation statements)
references
References 6 publications
0
10
0
Order By: Relevance
“…is that it is impractical for large n or large c followed by a technical annoyance that some OHE implementations do not preserve the Hamming distances for sparse vectors (see illustration in Figure 1). Hence, this encoding is used in conjunction with problem-specific feature selection or followed by dimensionality reduction from binary to binary vectors [17], [18], [19]. The latter is a viable heuristic that we wanted to improve upon by allowing non-binary compressed vectors (see Appendix A for a quick analysis of OHE followed by a state-of-the-art binary compression).…”
Section: Challenges In the Existing Approachesmentioning
confidence: 99%
“…is that it is impractical for large n or large c followed by a technical annoyance that some OHE implementations do not preserve the Hamming distances for sparse vectors (see illustration in Figure 1). Hence, this encoding is used in conjunction with problem-specific feature selection or followed by dimensionality reduction from binary to binary vectors [17], [18], [19]. The latter is a viable heuristic that we wanted to improve upon by allowing non-binary compressed vectors (see Appendix A for a quick analysis of OHE followed by a state-of-the-art binary compression).…”
Section: Challenges In the Existing Approachesmentioning
confidence: 99%
“…According to [45], [46], bit-string sparsity is beneficial for compression without accuracy performance degradation. In this study, we employed the binary compression scheme (BCS) introduced in [45], [46] for bit-string compression. Fig.…”
Section: Bit-string Compressionmentioning
confidence: 99%
“…Our proposed algorithm is very similar in nature to the BCS algorithm [22], [23], which suggests a randomized bucketing algorithm where each index of the input is randomly assigned to one of the O(ψ 2 ) buckets; ψ denotes the sparsity of the dataset. The sketch of an input vector is obtained by computing the parity of the bits fallen in each bucket.…”
Section: B Related Workmentioning
confidence: 99%
“…For Cosine Similarity, we compare BinSketch with SimHash [10], CBE [27] -a faster variant of SimHash, MinHash [26], using DOPH [24] in the algorithm of [26] instead of MinHash. For the Inner Product, BCS [23], Asymmetric MinHash [26], and Asymmetric DOPH -using DOPH [24] in [26], were the competing algorithms. In all these similarity measures, for sparse binary datasets, our proposed algorithm is faster, while simultaneously offering almost a similar performance as compared to the baselines.…”
Section: B Related Workmentioning
confidence: 99%