Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees

Shareghi, Ehsan; Petri, Matthias; Haffari, Gholamreza; Cohn, Trevor

doi:10.18653/v1/d15-1288

Cited by 11 publications

(34 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Decreasing ⌧ 1 implies increasing the number of contexts. Blue circles: the probabilistic su x tree code in [5]; orange circles: the SPST code in [31]. Our implementation has a small variance, so it is represented just as a green line (averages) rather than as circles.…”

Section: Smaller Variable-order Markov Modelsmentioning

confidence: 99%

“…This strategy can thus be implemented with maximal repeat pruning as well. To enforce that no su x of a context is itself a context [31,54,15], we can just unmark all nodes with a marked ancestor in ST. Enforcing that all prefixes of a context must be contexts (e.g. to make the Markov model memoryless: see e.g.…”

Section: Variants and Extensionsmentioning

confidence: 99%

“…Slower query times could be mitigated in practice by applying optimizations from matching statistics (like those in [9], some of which are implicitly enabled by our pruned topologies already), by precomputing counts that are too expensive to evaluate at query time (as done e.g. in [21,30,31,32]), and by taking advantage of the large number of cores that are standard in current servers. Scoring is indeed embarrassingly parallel in most applications, where the dataset to be queried is a large number of short strings, like sequencing reads or proteins.…”

Section: Speeding Up Scoringmentioning

confidence: 99%

See 2 more Smart Citations

A framework for space-efficient variable-order Markov models

Cunial

Alanko

Belazzougui

2018

Preprint

View full text Add to dashboard Cite

Motivation: Markov models with contexts of variable length are widely used in bioinformatics for representing sets of sequences with similar biological properties. When models contain many long contexts, existing implementations are either unable to handle genome-scale training datasets within typical memory budgets, or they are optimized for specific model variants and are thus inflexible. Results: We provide practical, versatile representations of variable-order Markov models and of interpolated Markov models, that support a large number of context-selection criteria, scoring functions, probability smoothing methods, and interpolations, and that take up to 4 times less space than previous implementations based on the su x array, regardless of the number and length of contexts, and up to 10 times less space than previous trie-based representations, or more, while matching the size of related, state-of-the-art data structures from Natural Language Processing.We describe how to further compress our indexes to a quantity related to the redundancy of the training data, saving up to 90% of their space on repetitive datasets, and making them become up to 60 times smaller than previous implementations ⇤ based on the su x array. Finally, we show how to exploit constraints on the length and frequency of contexts to further shrink our compressed indexes to half of their size or more, achieving data structures that are 100 times smaller than previous implementations based on the su x array, or more. This allows variable-order Markov models to be trained on bigger datasets and with longer contexts on the same hardware, thus possibly enabling new applications. Availability and implementation:https://github.com/jnalanko/VOMM

show abstract

Section: Smaller Variable-order Markov Modelsmentioning

confidence: 99%

Section: Variants and Extensionsmentioning

confidence: 99%

Section: Speeding Up Scoringmentioning

confidence: 99%

See 1 more Smart Citation

A framework for space-efficient variable-order Markov models

Cunial

Alanko

Belazzougui

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…We also demonstrate the efficacy of our implementation for handling strings on large alphabets (with many millions of distinct symbols), which is important, e.g., for applications in natural language processing [21] and information retrieval [10]. Efficiency for large alphabets has been to date unaddressed by previous studies on EM suffix sorting [6,7,4,12,18,13], in all of which a byte alphabet is assumed.…”

Section: Introductionmentioning

confidence: 97%

Engineering External Memory Induced Suffix Sorting

Kärkkäinen

Kempa

Puglisi

et al. 2017

2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX)

View full text Add to dashboard Cite

Suffix sorting -determining the lexicographical order of all the suffixes of a string -is one of the most important problems in string processing. The resulting data structure is called the suffix array (SA) and underpins dozens of applications in bioinformatics, data compression, and information retrieval. When the size of the input string or the SA exceeds that of internal memory (RAM), an external memory (EM) suffix sorting algorithm must be used. The most scalable of these EM methods is due to Bingmann et al. (Proc. ALENEX 2013), and is essentially a careful disk-based implementation of the so-called induced sorting technique used by the fastest RAM suffix sorting algorithms.In this paper we show how to greatly improve the efficiency of induced suffix sorting in external memory via a non-trivial reorganization of the computation involved. Our experiments show this new approach to be twice as fast as state-of-the-art methods, while, just as significantly, using a third of the disk memory. We also demonstrate the efficacy of our implementation for handling strings on large alphabets (with many millions of distinct symbols), which is important, e.g., for applications in natural language processing and information retrieval, but unaddressed by previous EM suffix sorting implementations.Our implementation uses a (EM) radix heap data structure and, as a side result of independent interest, we introduce a new operation for radix heaps and other monotone priority queues called min-comp, which we believe to be useful for many other applications, including discrete event simulation and sweep line algorithms, even in internal memory.

show abstract

“…This makes modified Kneser-Ney the de-facto choice for language model toolkits. The following software libraries, widely used in both academia and industry (e.g., Google [5,8] and Facebook [11]), all support modified Kneser-Ney smoothing: KenLM [25], BerkeleyLM [44], RandLM [52], Expgram [57], MSRLM [42], SRILM [51], IRSTLM [21] and the recent approach based on suffix trees by Shareghi et al [49,50]. For such reasons, Kneser-Ney is the model we consider in this work too and that we review in Section 4.…”

Section: Introductionmentioning

confidence: 99%

Handling Massive N -Gram Datasets Efficiently

Pibiri

Venturini

2019

ACM Trans. Inf. Syst.

View full text Add to dashboard Cite

This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Performing these two tasks efficiently is fundamental for several applications in the fields of Information Retrieval, Natural Language Processing and Machine Learning, such as auto-completion in search engines and machine translation.Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Compared to the state-of-the-art proposals, our data structures outperform all of them for space usage, without compromising their time performance. More precisely, the most space-efficient proposals in the literature, that are both quantized and lossy, are not smaller than our trie data structure and up to 5 times slower. Conversely, we are as fast as the fastest competitor, but also retain an advantage of up to 65% in absolute space.Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-theart algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5× on the total running time of the state-of-the-art approach.

show abstract

Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees

Cited by 11 publications

References 21 publications

A framework for space-efficient variable-order Markov models

A framework for space-efficient variable-order Markov models

Engineering External Memory Induced Suffix Sorting

Handling Massive N -Gram Datasets Efficiently

Contact Info

Product

Resources

About