On Universal Compression with Constant Random Access

Tatwawadi, Kedar; Bidokhti, Shirin Saeedi; Weissman, Tsachy

doi:10.1109/isit.2018.8437931

Cited by 14 publications

(22 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…, X n ) has finite-range correlations or restricted structure (Indeed, this is confirmed by our impossibility result for general LDSC in Corollary 3). In the case of k-order Markov chains, where the k'th empirical and Shannon entropies actually match (H k (X) = H(X)), all previously known LDSC schemes generally have redundancy at least r ≥ Ω(n/ lg n) regardless of the entropy of the source and decoding time (see e.g., [DLRR13,TBW18] and references therein). In contrast, our schemes for achieve r = O(1) redundancy and constant (or at most logarithmic) query time on a RAM.…”

Section: Related Workmentioning

confidence: 99%

“…At the same time, in many modern storage applications such as genome sequencing and analysis, real-time financial trading, image processing etc., databases are no longer merely serving archival purposes -data needs to be continually accessed and manipulated for training, prediction and real-time statistical decision making [TBW18, HBB + 18, GPL + 00, THOW16,CS00]. This inherent tension between compression and search, i.e., the need to perform local computations and search over the compressed data itself without first decompressing the dataset, has motivated the design of compressions schemes which provide random access to individual datapoints, at a small compromise of the compression rate, giving rise to the notion of locally-decodable source coding (LDSC) [Pat08, DLRR13, MHMP15, MCW15,TBW18]. Local decodability is also a crucial aspect in distributed file systems, where the energy cost of random-accessing a memory address is typically much higher than that of sending its actual content, especially in SSD hardware [APW + 08].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

How to Store a Random Walk

Viola¹,

Weinstein²,

Yu³

2020

Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms

View full text Add to dashboard Cite

Motivated by storage applications, we study the following data structure problem: An encoder wishes to store a collection of jointly-distributed files X := (X 1 , X 2 , . . . , X n ) ∼ µ which are correlated (H µ (X) ≪ i H µ (X i )), using as little (expected) memory as possible, such that each individual file X i can be recovered quickly with few (ideally constant) memory accesses.In the case of independent random files, a dramatic result by Pǎtraşcu (FOCS'08) and subsequently by Dodis, Pǎtraşcu and Thorup (STOC'10) shows that it is possible to store X using just a constant number of extra bits beyond the information-theoretic minimum space, while at the same time decoding each X i in constant time. However, in the (realistic) case where the files are correlated, much weaker results are known, requiring at least Ω(n/poly lg n) extra bits for constant decoding time, even for "simple" joint distributions µ.We focus on the natural case of compressing Markov chains, i.e., storing a length-n random walk on any (possibly directed) graph G. Denoting by κ(G, n) the number of length-n walks on G, we show that there is a succinct data structure storing a random walk using lg 2 κ(G, n) + O(lg n) bits of space, such that any vertex along the walk can be decoded in O(1) time on a word-RAM. If the graph is strongly connected (e.g., undirected), the space can be improved to only lg 2 κ(G, n) + 5 extra bits. For the harder task of matching the point-wise optimal space of the walk, i.e., the empirical entropy n−1 i=1 lg(deg(v i )), we present a data structure with O(1) extra bits at the price of O(lg n) decoding time, and show that any improvement on this would lead to an improved solution on the long-standing Dictionary problem. All of our data structures support the online version of the problem with constant update and query time.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

How to Store a Random Walk

Viola¹,

Weinstein²,

Yu³

2020

Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms

View full text Add to dashboard Cite

show abstract

“…The problem of locally decodable compression has received significant attention from a theoretical point of view [5], [6], [23], [24]. These works outline the fundamental principles and limits for the compression schemes that have good random access properties.…”

Section: Compression Across Sourcesmentioning

confidence: 99%

“…These works outline the fundamental principles and limits for the compression schemes that have good random access properties. Notably, the recent work in [5] presents a generic scheme that can be used to systematically convert any universal compressor, e.g., LZ-based schemes, into a scheme that no longer has optimal compression, but instead has constant random access. A greater compromise on the compression ratio can be made to reduce the random access cost as desired.…”

Section: Compression Across Sourcesmentioning

confidence: 99%

“…Importantly, they generally provide poor support for random access, meaning that all of the compressed data must be decompressed to retrieve even a single bit. This can make these codes ill-suited for the requirements of modern data storage systems, where many applications make continuous queries to the data [5]. A suitable compression scheme thus requires the data to be stored in a manner that facilitates such usage.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Randomly Accessible Lossless Compression Scheme for Time-Series Data

Vestergaard

Lucani

Zhang

2020

IEEE INFOCOM 2020 - IEEE Conference on Computer Communications

View full text Add to dashboard Cite

We detail a practical compression scheme for lossless compression of time-series data, based on the emerging concept of generalized deduplication. As data is no longer stored for just archival purposes, but needs to be continuously accessed in many applications, the scheme is designed for low-cost random access to its compressed data, avoiding decompression. With this method, an arbitrary bit of the original data can be read by accessing only a few hundred bits in the worst case, several orders of magnitude fewer than state-of-the-art compression schemes. Subsequent retrieval of bits requires visiting at most a few tens of bits. A comprehensive evaluation of the compressor on eight reallife data sets from various domains is provided. The cost of this random access capability is a loss in compression ratio compared with the state-of-the-art compression schemes BZIP2 and 7z, which can be as low as 5% depending on the data set. Compared to GZIP, the proposed scheme has a better compression ratio for most of the data sets. Our method has massive potential for applications requiring frequent random accesses, as the only existing approach with comparable random access cost is to store the data without compression.

show abstract