Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model

Mielke, Sebastian J.; Eisner, Jason

doi:10.1609/aaai.v33i01.33016843

Cited by 27 publications

(11 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With various lexers written for the TokenBuffer API, users can also create their high-speed custom tokenizers with ease. The package also provides a simple reversible tokenizer (Mielke, 2019;Mielke & Eisner, 2018) that works by leaving certain merge symbols, as a means to reconstruct tokens into the original string.…”

Section: Discussionmentioning

confidence: 99%

WordTokenizers.jl: Basic tools for tokenizing natural language in Julia

Kaushal¹,

White²,

Innes³

et al. 2020

JOSS

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

WordTokenizers.jl: Basic tools for tokenizing natural language in Julia

Kaushal¹,

White²,

Innes³

et al. 2020

JOSS

View full text Add to dashboard Cite

“…The language-specific value of this random slope parameter is then indicative of a stronger or weaker relationship between H and f for this language. As information encoding units, we estimate on two levels: on the level of words and, instead of estimating on the level of characters, we tokenize our text into sub-word units by byte pair encoding (BPE) 63,64 which plays an important role in many state-of-the-art natural language model applications 65,66 and provides strong baseline results on a multilingual corpus 67 . In total, we trained seven different LMs on the dataranging from very simple n-gram models to stateof-the-art deep neural networks (Table 3).…”

Section: Lines Represent Fitted Values Based On An Ansatz Function Th...mentioning

confidence: 99%

Languages with more speakers tend to be harder to (machine-)learn

Koplenig,

Wolfer

2023

Preprint

View full text Add to dashboard Cite

Computational language models (LMs), most notably exemplified by the widespread success of OpenAI's ChatGPT chatbot, show impressive performance on a wide range of linguistic tasks, thus providing cognitive science and linguistics with a computational working model to empirically study different aspects of human language. Here, we use LMs to test the hypothesis that languages with more speakers tend to be easier to learn. In two experiments, we train several LMs – ranging from very simple n-gram models to state-of-the-art deep neural networks – on written cross-linguistic corpus data covering 1,294 different languages and statistically estimate learning difficulty. Using a variety of quantitative methods and machine learning techniques to account for phylogenetic relatedness and geographical proximity of languages, we show that there is robust evidence for a relationship between learning difficulty and speaker population size. However, contrary to expectations derived from previous research, our results suggest that languages with more speakers tend to be harder to learn.

show abstract

“…Some of the contexts are allowed to be non-contiguous in order to capture longer-term dependencies 48 and CMIX uses long short-term memory 61 (LSTM) trained by backpropagation as a byte-level mixer 59 . In addition, instead of estimating on the level of either characters or words, we tokenize our text into sub-word units by byte pair encoding (BPE) 27,62 which plays an important role in many state-of-the-art natural language model applications such as GPT-3 63 or SentencePiece 64 and provides strong baseline results on a multilingual corpus 65 .…”

Section: Comparing Complexity Rankings Across Corporamentioning

confidence: 99%

Human languages trade off complexity against efficiency

Koplenig

Wolfer

Meyer

2022

Preprint

View full text Add to dashboard Cite

One of the fundamental questions about human language is whether all languages are equally complex. To answer this long-standing question, we conduct a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6,500 different documents as represented in 41 multilingual text collections consisting of ~3.5 billion words or ~9.0 billion characters and covering 2,069 different languages that are spoken as a native language by more than 90% of the world population or ~46% of all languages that have a standardized written representation. Statistically inferring the entropy of each language-model as an index of (un)predictability/complexity allows us to refute the equi-complexity hypothesis, but also unveils a previously undocumented complexity-efficiency trade-off: high entropy languages are information-theoretically more efficient because they tend to need fewer symbols to encode messages. Our findings additionally contribute to debates about language evolution/diversity by showing that this trade-off is partly shaped by the social environment in which languages are being used.

show abstract

Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model

Cited by 27 publications

References 19 publications

WordTokenizers.jl: Basic tools for tokenizing natural language in Julia

WordTokenizers.jl: Basic tools for tokenizing natural language in Julia

Languages with more speakers tend to be harder to (machine-)learn

Human languages trade off complexity against efficiency

Contact Info

Product

Resources

About