Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2023
DOI: 10.18653/v1/2023.acl-long.284
|View full text |Cite
|
Sign up to set email alerts
|

Tokenization and the Noiseless Channel

Abstract: Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to efficient channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal en… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 34 publications
0
1
0
Order By: Relevance
“…Ding et al (2019); Gowda and May (2020) examine the effect of BPE vocabulary size and Bogoychev and Chen (2021) experiment with using BPE trained on a different domain and is therefore suboptimal for the primary one. Tokenization of the training data is well-known to affect machine translation and other NLP model performance (Domingo et al, 2023;Toraman et al, 2023;Zouhar et al, 2023).…”
Section: Arxiv:240116055v1 [Cscl] 29 Jan 2024 2 Related Workmentioning
confidence: 99%
“…Ding et al (2019); Gowda and May (2020) examine the effect of BPE vocabulary size and Bogoychev and Chen (2021) experiment with using BPE trained on a different domain and is therefore suboptimal for the primary one. Tokenization of the training data is well-known to affect machine translation and other NLP model performance (Domingo et al, 2023;Toraman et al, 2023;Zouhar et al, 2023).…”
Section: Arxiv:240116055v1 [Cscl] 29 Jan 2024 2 Related Workmentioning
confidence: 99%