Matthias Petri scite author profile

Engineering efficient implementations of compact and succinct structures is a time-consuming and challenging task, since there is no standard library of easy-touse, highly optimized, and composable components. One consequence is that measuring the practical impact of new theoretical proposals is a difficult task, since older baseline implementations may not rely on the same basic components, and reimplementing from scratch can be very time-consuming. In this paper we present a framework for experimentation with succinct data structures, providing a large set of configurable components, together with tests, benchmarks, and tools to analyze resource requirements. We demonstrate the functionality of the framework by recomposing succinct solutions for document retrieval.

show abstract

Optimized succinct data structures for massive data

Gog

Petri

2013

Softw Pract Exp

View full text Add to dashboard Cite

SUMMARYSuccinct data structures provide the same functionality as their corresponding traditional data structure in compact space. We improve on functions rank and select, which are the basic building blocks of FM‐indexes and other succinct data structures. First, we present a cache‐optimal, uncompressed bitvector representation that outperforms all existing approaches. Next, we improve, in both space and time, on a recent result by Navarro and Providel on compressed bitvectors. Last, we show techniques to perform rank and select on 64‐bit words that are up to three times faster than existing methods. In our experimental evaluation, we first show how our improvements affect cache and runtime performance of both operations on data sets larger than commonly used in the evaluation of succinct data structures. Our experiments show that our improvements to these basic operations significantly improve the runtime performance and compression effectiveness of FM‐indexes on small and large data sets. To our knowledge, our improvements result in FM‐indexes that are either smaller or faster than all current state of the art implementations. Copyright © 2013 John Wiley & Sons, Ltd.

show abstract

Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees

Shareghi¹,

Petri²,

Haffari³

et al. 2015

View full text Add to dashboard Cite

Efficient methods for storing and querying language models are critical for scaling to large corpora and high Markov orders. In this paper we propose methods for modeling extremely large corpora without imposing a Markov condition. At its core, our approach uses a succinct index -a compressed suffix tree -which provides near optimal compression while supporting efficient search. We present algorithms for on-the-fly computation of probabilities under a Kneser-Ney language model. Our technique is exact and although slower than leading LM toolkits, it shows promising scaling properties, which we demonstrate through ∞-order modeling over the full Wikipedia collection.

show abstract

Index Compression Using Byte-Aligned ANS Coding and Two-Dimensional Contexts

Moffat

Petri

2018

View full text Add to dashboard Cite

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Shareghi

Petri

Haffari

et al. 2016

TACL

View full text Add to dashboard Cite

Efficient methods for storing and querying are critical for scaling high-order m-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500×, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Matthias Petri

From Theory to Practice: Plug and Play with Succinct Data Structures

Optimized succinct data structures for massive data

Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees

Index Compression Using Byte-Aligned ANS Coding and Two-Dimensional Contexts

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Contact Info

Product

Resources

About