Efficient Dimensionality Reduction for Sparse Binary Data

Pratap, Rameshwar; Kulkarni, Raghav; Sohony, Ishan

doi:10.1109/bigdata.2018.8622338

Cited by 8 publications

(12 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Each gene represents a data point and for every gene, the dataset stores the integer-valued read-count of that gene corresponding to each cell -these readcounts form our features. Binary Compression Scheme (BCS) [34] * Hamming LSH (H-LSH) [12] * Feature Hashing (FH) [41] Signed-random projection/SimHash (SH) [9] Kendall rank correlation coefficient (KT) [19] Latent Semantic Analysis (LSA) [11] Latent Dirichlet Allocation (LDA) [6] Multiple Correspondence Analysis (MCA) [5] Non-neg. Matrix Factorisation (NNMF) [24] Variational auto-encoder (VAE) [21] vanilla Principal component analysis (PCA) * BCS and H-LSH are applied on a BinEm embedding Baseline algorithms: The alternative approaches that we compare against are listed in Table 2.…”

Section: Methodsmentioning

confidence: 99%

“…BinSketch can be applied to this binary sketch to further compress it into low dimensional binary vectors; the original pairwise Hamming distances can then be approximated from those vectors. Note that there are other known compression algorithms for binary vectors such as BCS [34]. However, we prefer to use BinSketch as it offers both better theoretical as well as practical guarantees on the quality of its estimation.…”

Section: Related Workmentioning

confidence: 99%

“…One of the naive approaches to perform dimensionality reduction for categorical data is to first represent it via binary vectors using one-hot encoding, where a feature value x is replaced by a c + 1 dimensional binary vector with 1 at the position x, and 0 otherwise. We can then further apply known dimensionality reduction algorithms for binary data [33,28,35,34] on those binary vectors. However, this approach becomes impractical when the number of categories is large and may lead to an exponential blowup in the dimension of the resultant binary vector.…”

Section: Introductionmentioning

confidence: 99%

“…Even though there are several results known along this direction, e.g., the Johnson Lindenstrauss lemma [17], Feature Hashing [41], random projections for clustering [7], all of them deal with distances on the Euclidean space. Recently several methods were proposed for distances between discrete vectors, e.g., BCS [34], however, they were specifically designed for binary vectors without any scope for extension towards categorical vectors.…”

Section: Introductionmentioning

confidence: 99%

“…Our proposal is to use a random binary encoding that not only retains the original dimension but also preserves the Hamming distance (see Lemma 1 and Lemma 2). For the second step, we observed that a few candidates BinSketch [33], BCS [34], MinHash [8], SimHash [9], OddSketch [28] are already available for embedding binary vectors. We decided to use BinSketch for the second step for a few important reasons.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Efficient Binary Embedding of Categorical Data using BinSketch

Verma¹,

Pratap²,

Bera³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and our distance estimation algorithm Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches. The minimum dimension of the sketches required by Cham to ensure a good estimation theoretically depends only on the sparsity of the data points -making it useful for many real-life scenarios involving sparse datasets. We present a rigorous theoretical analysis of our approach and supplement it with extensive experiments on several high-dimensional real-world data sets, including one with over a million dimensions. We show that the Cabin and Cham duo is a significantly fast and accurate approach for tasks such as RMSE, all-pair similarity, and clustering when compared to working with the full dataset and other dimensionality reduction techniques.

show abstract