Efficient Compression Technique for Sparse Sets

Pratap, Rameshwar; Sohony, Ishan; Kulkarni, Raghav

doi:10.1007/978-3-319-93040-4_14

Cited by 8 publications

(10 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…is that it is impractical for large n or large c followed by a technical annoyance that some OHE implementations do not preserve the Hamming distances for sparse vectors (see illustration in Figure 1). Hence, this encoding is used in conjunction with problem-specific feature selection or followed by dimensionality reduction from binary to binary vectors [17], [18], [19]. The latter is a viable heuristic that we wanted to improve upon by allowing non-binary compressed vectors (see Appendix A for a quick analysis of OHE followed by a state-of-the-art binary compression).…”

Section: Challenges In the Existing Approachesmentioning

confidence: 99%

Dimensionality Reduction for Categorical Data

Bera,

Pratap,

Verma

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Categorical attributes are those that can take a discrete set of values, e.g., colours. This work is about compressing vectors over categorical attributes to low-dimension discrete vectors. The current hash-based methods compressing vectors over categorical attributes to low-dimension discrete vectors do not provide any guarantee on the Hamming distances between the compressed representations. Here we present FSketch to create sketches for sparse categorical data and an estimator to estimate the pairwise Hamming distances among the uncompressed data only from their sketches. We claim that these sketches can be used in the usual data mining tasks in place of the original data without compromising the quality of the task. For that, we ensure that the sketches also are categorical, sparse, and the Hamming distance estimates are reasonably precise. Both the sketch construction and the Hamming distance estimation algorithms require just a single-pass; furthermore, changes to a data point can be incorporated into its sketch in an efficient manner. The compressibility depends upon how sparse the data is and is independent of the original dimension -making our algorithm attractive for many real-life scenarios. Our claims are backed by rigorous theoretical analysis of the properties of FSketch and supplemented by extensive comparative evaluations with related algorithms on some real-world datasets. We show that FSketch is significantly faster, and the accuracy obtained by using its sketches are among the top for the standard unsupervised tasks of RMSE, clustering and similarity search.

show abstract

Section: Challenges In the Existing Approachesmentioning

confidence: 99%

Dimensionality Reduction for Categorical Data

Bera,

Pratap,

Verma

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…According to [45], [46], bit-string sparsity is beneficial for compression without accuracy performance degradation. In this study, we employed the binary compression scheme (BCS) introduced in [45], [46] for bit-string compression. Fig.…”

Section: Bit-string Compressionmentioning

confidence: 99%

Bit-string representation of a fingerprint image by normalized local structures

Kho

Teoh

Lee

et al. 2020

Pattern Recognition

View full text Add to dashboard Cite

“…Our proposed algorithm is very similar in nature to the BCS algorithm [22], [23], which suggests a randomized bucketing algorithm where each index of the input is randomly assigned to one of the O(ψ 2 ) buckets; ψ denotes the sparsity of the dataset. The sketch of an input vector is obtained by computing the parity of the bits fallen in each bucket.…”

Section: B Related Workmentioning

confidence: 99%

“…For Cosine Similarity, we compare BinSketch with SimHash [10], CBE [27] -a faster variant of SimHash, MinHash [26], using DOPH [24] in the algorithm of [26] instead of MinHash. For the Inner Product, BCS [23], Asymmetric MinHash [26], and Asymmetric DOPH -using DOPH [24] in [26], were the competing algorithms. In all these similarity measures, for sparse binary datasets, our proposed algorithm is faster, while simultaneously offering almost a similar performance as compared to the baselines.…”

Section: B Related Workmentioning

confidence: 99%

Efficient Sketching Algorithm for Sparse Binary Data

Pratap

Bera

Revanuru

2019

2019 IEEE International Conference on Data Mining (ICDM)

Self Cite

View full text Add to dashboard Cite

Recent advancement of the WWW, IOT, social network, e-commerce, etc. have generated a large volume of data. These datasets are mostly represented by high dimensional and sparse datasets. Many fundamental subroutines of common data analytic tasks such as clustering, classification, ranking, nearest neighbour search, etc. scale poorly with the dimension of the dataset. In this work, we address this problem and propose a sketching (alternatively, dimensionality reduction) algorithm -BinSketch (Binary Data Sketch) -for sparse binary datasets. BinSketch preserves the binary version of the dataset after sketching and maintains estimates for multiple similarity measures such as Jaccard, Cosine, Inner-Product similarities, and Hamming distance, on the same sketch. We present a theoretical analysis of our algorithm and complement it with extensive experimentation on several real-world datasets. We compare the performance of our algorithm with the state-of-the-art algorithms on the task of mean-square-error and ranking. Our proposed algorithm offers a comparable accuracy while suggesting a significant speedup in the dimensionality reduction time, with respect to the other candidate algorithms. Our proposal is simple, easy to implement, and therefore can be adopted in practice. 1

show abstract

Efficient Compression Technique for Sparse Sets

Cited by 8 publications

References 6 publications

Dimensionality Reduction for Categorical Data

Dimensionality Reduction for Categorical Data

Bit-string representation of a fingerprint image by normalized local structures

Efficient Sketching Algorithm for Sparse Binary Data

Contact Info

Product

Resources

About