Fast, Small and Exact: Infinite-order Language Modelling with                     Compressed Suffix Trees

Shareghi, Ehsan; Petri, Matthias; Haffari, Gholamreza; Cohn, Trevor

doi:10.1162/tacl_a_00112

Cited by 27 publications

(20 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Once our index is built, creating one with a di↵erent context selection criterion is faster than building the competitor from scratch, and it takes half the space required for building our index from scratch. Building the index in [32] takes between 5 and 9 bytes per character, which is comparable to our construction, and between 1.1 and 3.2 microseconds per character, which is faster than or comparable to our non-pruned index.…”

Section: Complexity and Comparison To The Competitorsmentioning

confidence: 51%

“…Results are in Figure 5. We don't compare scoring time to [32], since the latter supports just one scoring function which is significantly di↵erent from the ones we consider.…”

Section: Comparison To the Competitorsmentioning

confidence: 99%

“…Slower query times could be mitigated in practice by applying optimizations from matching statistics (like those in [9], some of which are implicitly enabled by our pruned topologies already), by precomputing counts that are too expensive to evaluate at query time (as done e.g. in [21,30,31,32]), and by taking advantage of the large number of cores that are standard in current servers. Scoring is indeed embarrassingly parallel in most applications, where the dataset to be queried is a large number of short strings, like sequencing reads or proteins.…”

Section: Speeding Up Scoringmentioning

confidence: 99%

“…Alternatively, we could use a multilevel scheme like directly addressable codes [12], which have already been used e.g. by [32] for storing precomputed counts. Specifically, a first array could contain one byte for every maximal repeat in preorder, and the first bit in such byte could mark whether the length of the maximal repeat is longer than 2 7 .…”

Section: Speeding Up Scoringmentioning

confidence: 99%

“…VOMMs and their variants have been applied to metagenomic samples as well. For example, VOMMs have been used to separate the reads of a eukaryotic host from those of an intracellular prokaryotic parasite [24]; to model known genomes in order to estimate, given a metagenomic sample, the genome or taxon a read was sampled from [12]; to define compositional distances between metatranscriptomic samples [32]; and to model the clusters produced by reference-free binning of metagenomic reads [28,56,58].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A framework for space-efficient variable-order Markov models

Cunial

Alanko

Belazzougui

2018

Preprint

View full text Add to dashboard Cite

Motivation: Markov models with contexts of variable length are widely used in bioinformatics for representing sets of sequences with similar biological properties. When models contain many long contexts, existing implementations are either unable to handle genome-scale training datasets within typical memory budgets, or they are optimized for specific model variants and are thus inflexible. Results: We provide practical, versatile representations of variable-order Markov models and of interpolated Markov models, that support a large number of context-selection criteria, scoring functions, probability smoothing methods, and interpolations, and that take up to 4 times less space than previous implementations based on the su x array, regardless of the number and length of contexts, and up to 10 times less space than previous trie-based representations, or more, while matching the size of related, state-of-the-art data structures from Natural Language Processing.We describe how to further compress our indexes to a quantity related to the redundancy of the training data, saving up to 90% of their space on repetitive datasets, and making them become up to 60 times smaller than previous implementations ⇤ based on the su x array. Finally, we show how to exploit constraints on the length and frequency of contexts to further shrink our compressed indexes to half of their size or more, achieving data structures that are 100 times smaller than previous implementations based on the su x array, or more. This allows variable-order Markov models to be trained on bigger datasets and with longer contexts on the same hardware, thus possibly enabling new applications. Availability and implementation:https://github.com/jnalanko/VOMM

show abstract

Section: Complexity and Comparison To The Competitorsmentioning

confidence: 51%

“…Results are in Figure 5. We don't compare scoring time to [32], since the latter supports just one scoring function which is significantly di↵erent from the ones we consider.…”

Section: Comparison To the Competitorsmentioning

confidence: 99%

Section: Speeding Up Scoringmentioning

confidence: 99%

Section: Speeding Up Scoringmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A framework for space-efficient variable-order Markov models

Cunial

Alanko

Belazzougui

2018

Preprint

View full text Add to dashboard Cite

show abstract

Neural Machine Translation

Koehn

2020

View full text Add to dashboard Cite

Imagine that you are a translator. You are asked to translate from German to English and you come across the word Sitzpinkler. Its literal meaning is someone who pees sitting down, but its intended meaning is wimp. The implication is that a man who sits down to pee is not a real man.But there is more going on here. This word was popularized on a comedy show that coined several other terms in this fashion. One is Warmduscher, someone who takes a warm shower, or even Frauenversteher, someone who understands women. In fact, a whole fad emerged to come up with new terms like this. All these terms are used as insults, but not as real serious insults. They are used very much in jest, a slight mocking.These terms are also firmly a reflection of the current zeitgeist, when the expectations of what it means to be a man are changing. Using such terms is a light-hearted commentary on this change. It is not really unmanly to sit down to pee, although it is something that women do and hence a man who wants to be a traditional "real" man loses some of his identity this way. As you can see, there is a lot going on here.So, what is a translator going to do? Probably use wimp and move on. This example demonstrates that translation is basically impossible. The meaning of words in a language are tied to their prior use in a specific culture. Four score and seven years is not just any way to say 87 years. And I have a dream implies much more than just announcing a vision of the future. Words carry not only an explicit meaning but also an undercurrent of implications that often does not have any equivalent in another language and another culture.

show abstract

Preface

2020

Neural Machine Translation

View full text Add to dashboard Cite

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Cited by 27 publications

References 17 publications

A framework for space-efficient variable-order Markov models

A framework for space-efficient variable-order Markov models

Neural Machine Translation

Preface

Contact Info

Product

Resources

About