Jérémy Barbay scite author profile

We define and design succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures that ideally occupy asymptotically less space than the information-theoretic lower bound on the space required to encode the given data, and support an extended set of operations using the basic operators defined in the ADT. The main advantage of succinct indexes as opposed to succinct (integrated data/index) encodings is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data. In this article, we present succinct indexes for various data types, namely strings, binary relations and multilabeled trees. Given the support for the interface of the ADTs of these data types, we can support various useful operations efficiently by constructing succinct indexes for them. When the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data.Using our techniques, we design a succinct encoding that represents a string of length n over an alphabet of size σ using nH k (S)+lg σ ·o(n)+O(nlg σ /lg lg lg σ ) bits to support access/rank/select operations in o((lg lg σ ) 1+ ) time, for any fixed constant > 0. We also design a succinct text index using nH 0 (S) + O(n lg σ /lg lg σ ) bits that supports finding all the occ occurrences of a given pattern of length m in O(mlg lg σ + occ lg n/ lg σ ) time, for any fixed constant 0 < < 1. Previous results on these two problems either have a lg σ factor instead of lg lg σ in the running time, or are not compressed. Finally, we present succinct encodings of binary relations and multi-labeled trees that are more compact than previous structures.

show abstract

Efficient Fully-Compressed Sequence Representations

Barbay

et al. 2012

View full text Add to dashboard Cite

We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in nH 0 (s) + o(n)(H 0 (s)+1) bits, where H 0 (s) is the zero-order entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in worst-case time O (lg lg σ) and average time O (lg H 0 (s)). The worst-case complexity matches the best previous results, yet these had been achieved with data structures using nH 0 (s) + o(n lg σ) bits. On highly compressible sequences the o(n lg σ) bits of the redundancy may be significant compared to the the nH 0 (s) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our average-case complexity is unprecedented.Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy.The result also improves upon the best current compressed representations of several other data structures. For example, we achieve (i) compressed redundancy, retaining the best time complexities, for the smallest existing full-text self-indexes; (ii) compressed permutations π with times for π() and π −1 () improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors.Our structure is practical on large alphabets. Our experiments show that, as predicted by theory, it dominates the space/time tradeoff map of all the sequence representations, both in synthetic and application scenarios.

show abstract

Alphabet Partitioning for Compressed Rank/Select and Applications

Barbay

Gagie

Navarro

et al. 2010

View full text Add to dashboard Cite

Abstract. We present a data structure that stores a string s[1..n] over the alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zero-order entropy of s. This data structure supports the queries access and rank in time O (lg lg σ), and the select query in constant time. This result improves on previously known data structures using nH0(s) + o(n lg σ) bits, where on highly compressible instances the redundancy o(n lg σ) cease to be negligible compared to the nH0(s) bits that encode the data. The technique is based on combining previous results through an ingenious partitioning of the alphabet, and practical enough to be implementable. It applies not only to strings, but also to several other compact data structures. For example, we achieve (i) faster search times and lower redundancy for the smallest existing full-text self-index; (ii) compressed permutations π with times for π() and π −1 () improved to log-logarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets.

show abstract

Instance-Optimal Geometric Algorithms

Afshani

Barbay

Chan

2017

J. ACM

View full text Add to dashboard Cite

We prove the existence of an algorithm A for computing 2D or 3D convex hulls that is optimal for every point set in the following sense: for every sequence σ of n points and for every algorithm A ′ in a certain class A , the running time of A on input σ is at most a constant factor times the running time of A ′ on the worst possible permutation of σ for A ′. In fact, we can establish a stronger property: for every sequence σ of points and every algorithm A ′, the running time of A on σ is at most a constant factor times the average running time of A ′ over all permutations of σ. We call algorithms satisfying these properties instance optimal in the order-oblivious and random-order setting. Such instance-optimal algorithms simultaneously subsume output-sensitive algorithms and distribution-dependent average-case algorithms, and all algorithms that do not take advantage of the order of the input or that assume the input are given in a random order. The class A under consideration consists of all algorithms in a decision tree model where the tests involve only multilinear functions with a constant number of arguments. To establish an instance-specific lower bound, we deviate from traditional Ben-Or-style proofs and adopt a new adversary argument. For 2D convex hulls, we prove that a version of the well-known algorithm by Kirkpatrick and Seidel [1986] or Chan, Snoeyink, and Yap [1995] already attains this lower bound. For 3D convex hulls, we propose a new algorithm. We further obtain instance-optimal results for a few other standard problems in computational geometry, such as maxima in 2D and 3D, orthogonal line segment intersection in 2D, finding bichromatic L ∞ -close pairs in 2D, offline orthogonal range searching in 2D, offline dominance reporting in 2D and 3D, offline half-space range reporting in 2D and 3D, and offline point location in 2D. Our framework also reveals a connection to distribution-sensitive data structures and yields new results as a byproduct, for example, on online orthogonal range searching in 2D and online half-space range reporting in 2D and 3D.

show abstract

An experimental investigation of set intersection algorithms for text searching

Barbay

López-Ortíz

et al. 2009

ACM J. Exp. Algorithmics

View full text Add to dashboard Cite

Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the intersection context. We perform an experimental comparison with the algorithms from the previous studies from Demaine, López-Ortiz and Munro [ALENEX 2001], and from Baeza-Yates and Salinger [SPIRE 2005]; in addition, we implement and test the intersection algorithm from Barbay and Kenyon [SODA 2002] and its randomized variant [SAGA 2003]. We consider both the random data set from Baeza-Yates and Salinger, the Google queries used by Demaine et al., a corpus provided by Google and a larger corpus from the TREC Terabyte 2006 efficiency query stream, along with its own query log. We measure the performance both in terms of the number of comparisons and searches performed, and in terms of the CPU time on two different architectures. Our results confirm or improve the results from both previous studies in their respective context (comparison model on real data and CPU measures on random data), and extend them to new contexts. In particular we show that value-based search algorithms perform well in posting lists in terms of the number of comparisons performed.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jérémy Barbay

Succinct indexes for strings, binary relations and multilabeled trees

Efficient Fully-Compressed Sequence Representations

Alphabet Partitioning for Compressed Rank/Select and Applications

Instance-Optimal Geometric Algorithms

An experimental investigation of set intersection algorithms for text searching

Contact Info

Product

Resources

About