Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 2015
DOI: 10.18653/v1/d15-1288
|View full text |Cite
|
Sign up to set email alerts
|

Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees

Abstract: Efficient methods for storing and querying language models are critical for scaling to large corpora and high Markov orders. In this paper we propose methods for modeling extremely large corpora without imposing a Markov condition. At its core, our approach uses a succinct index -a compressed suffix tree -which provides near optimal compression while supporting efficient search. We present algorithms for on-the-fly computation of probabilities under a Kneser-Ney language model. Our technique is exact and altho… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
34
0

Year Published

2017
2017
2019
2019

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 11 publications
(34 citation statements)
references
References 21 publications
0
34
0
Order By: Relevance
“…Decreasing ⌧ 1 implies increasing the number of contexts. Blue circles: the probabilistic su x tree code in [5]; orange circles: the SPST code in [31]. Our implementation has a small variance, so it is represented just as a green line (averages) rather than as circles.…”
Section: Smaller Variable-order Markov Modelsmentioning
confidence: 99%
See 2 more Smart Citations
“…Decreasing ⌧ 1 implies increasing the number of contexts. Blue circles: the probabilistic su x tree code in [5]; orange circles: the SPST code in [31]. Our implementation has a small variance, so it is represented just as a green line (averages) rather than as circles.…”
Section: Smaller Variable-order Markov Modelsmentioning
confidence: 99%
“…This strategy can thus be implemented with maximal repeat pruning as well. To enforce that no su x of a context is itself a context [31,54,15], we can just unmark all nodes with a marked ancestor in ST. Enforcing that all prefixes of a context must be contexts (e.g. to make the Markov model memoryless: see e.g.…”
Section: Variants and Extensionsmentioning
confidence: 99%
See 1 more Smart Citation
“…We also demonstrate the efficacy of our implementation for handling strings on large alphabets (with many millions of distinct symbols), which is important, e.g., for applications in natural language processing [21] and information retrieval [10]. Efficiency for large alphabets has been to date unaddressed by previous studies on EM suffix sorting [6,7,4,12,18,13], in all of which a byte alphabet is assumed.…”
Section: Introductionmentioning
confidence: 97%
“…This makes modified Kneser-Ney the de-facto choice for language model toolkits. The following software libraries, widely used in both academia and industry (e.g., Google [5,8] and Facebook [11]), all support modified Kneser-Ney smoothing: KenLM [25], BerkeleyLM [44], RandLM [52], Expgram [57], MSRLM [42], SRILM [51], IRSTLM [21] and the recent approach based on suffix trees by Shareghi et al [49,50]. For such reasons, Kneser-Ney is the model we consider in this work too and that we review in Section 4.…”
Section: Introductionmentioning
confidence: 99%