Ishan Sohony scite author profile

Recent technological advancements have led to the generation of huge amounts of data over the web, such as text, image, audio and video. Needless to say, most of this data is high dimensional and sparse, consider, for instance, the bag-of-words representation used for representing text. O en, an e cient search for similar data points needs to be performed in many applications like clustering, nearest neighbour search, ranking and indexing. Even though there have been signi cant increases in computational power, a simple brute-force similarity-search on such datasets is ine cient and at times impossible. us, it is desirable to get a compressed representation which preserves the similarity between data points. In this work, we consider the data points as sets and use Jaccard similarity as the similarity measure. Compression techniques are generally evaluated on the following parameters -1) Randomness required for compression, 2) Time required for compression, 3) Dimension of the data a er compression, and 4) Space required to store the compressed data. Ideally, the compressed representation of the data should be such, that the similarity between each pair of data points is preserved, while keeping the time and the randomness required for compression as low as possible.Recently, Pratap and Kulkarni [11], suggested a compression technique for compressing high dimensional, sparse, binary data while preserving the Inner product and Hamming distance between each pair of data points. In this work, we show that their compression technique also works well for Jaccard similarity. We present a theoretical proof of the same and complement it with rigorous experimentations on synthetic as well as real-world datasets. We also compare our results with the state-of-the-art "min-wise independent permutation", and show that our compression algorithm achieves almost equal accuracy while signi cantly reducing the compression time and the randomness. Moreover, a er compression our compressed representation is in binary form as opposed to integer in case of min-wise permutation, which leads to a signi cant reduction in search-time on the compressed data.

show abstract

Efficient Compression Technique for Sparse Sets

Pratap¹,

Sohony²,

Kulkarni³

2017

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ishan Sohony

Ensemble learning for credit card fraud detection

Efficient Dimensionality Reduction for Sparse Binary Data

Efficient Compression Technique for Sparse Sets

Efficient Compression Technique for Sparse Sets

Contact Info

Product

Resources

About