Recent advancement of the WWW, IOT, social network, e-commerce, etc. have generated a large volume of data. These datasets are mostly represented by high dimensional and sparse datasets. Many fundamental subroutines of common data analytic tasks such as clustering, classification, ranking, nearest neighbour search, etc. scale poorly with the dimension of the dataset. In this work, we address this problem and propose a sketching (alternatively, dimensionality reduction) algorithm -BinSketch (Binary Data Sketch) -for sparse binary datasets. BinSketch preserves the binary version of the dataset after sketching and maintains estimates for multiple similarity measures such as Jaccard, Cosine, Inner-Product similarities, and Hamming distance, on the same sketch. We present a theoretical analysis of our algorithm and complement it with extensive experimentation on several real-world datasets. We compare the performance of our algorithm with the state-of-the-art algorithms on the task of mean-square-error and ranking. Our proposed algorithm offers a comparable accuracy while suggesting a significant speedup in the dimensionality reduction time, with respect to the other candidate algorithms. Our proposal is simple, easy to implement, and therefore can be adopted in practice. 1
Recent technological advancements have led to the generation of huge amounts of data over the web, such as text, image, audio and video. Needless to say, most of this data is high dimensional and sparse, consider, for instance, the bag-of-words representation used for representing text. O en, an e cient search for similar data points needs to be performed in many applications like clustering, nearest neighbour search, ranking and indexing. Even though there have been signi cant increases in computational power, a simple brute-force similarity-search on such datasets is ine cient and at times impossible. us, it is desirable to get a compressed representation which preserves the similarity between data points. In this work, we consider the data points as sets and use Jaccard similarity as the similarity measure. Compression techniques are generally evaluated on the following parameters -1) Randomness required for compression, 2) Time required for compression, 3) Dimension of the data a er compression, and 4) Space required to store the compressed data. Ideally, the compressed representation of the data should be such, that the similarity between each pair of data points is preserved, while keeping the time and the randomness required for compression as low as possible.Recently, Pratap and Kulkarni [11], suggested a compression technique for compressing high dimensional, sparse, binary data while preserving the Inner product and Hamming distance between each pair of data points. In this work, we show that their compression technique also works well for Jaccard similarity. We present a theoretical proof of the same and complement it with rigorous experimentations on synthetic as well as real-world datasets. We also compare our results with the state-of-the-art "min-wise independent permutation", and show that our compression algorithm achieves almost equal accuracy while signi cantly reducing the compression time and the randomness. Moreover, a er compression our compressed representation is in binary form as opposed to integer in case of min-wise permutation, which leads to a signi cant reduction in search-time on the compressed data.
In this work, we present a randomized coreset construction for projective clustering, which involves computing a set of k closest j-dimensional linear (affine) subspaces of a given set of n vectors in d dimensions. Let A ∈ R n×d be an input matrix. An earlier deterministic coreset construction of Feldman et. al.[10] relied on computing the SVD of A. The best known algorithms for SVD require min{nd 2 , n 2 d} time, which may not be feasible for large values of n and d. We present a coreset construction by projecting the matrix A on some orthonormal vectors that closely approximate the right singular vectors of A. As a consequence, when the values of k and j are small, we are able to achieve a faster algorithm, as compared to [10], while maintaining almost the same approximation. We also benefit in terms of space as well as exploit the sparsity of the input dataset. Another advantage of our approach is that it can be constructed in a streaming setting quite efficiently.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.