2023
DOI: 10.1145/3578707
|View full text |Cite
|
Sign up to set email alerts
|

Impact of Tokenization on Language Models: An Analysis for Turkish

Abstract: Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, in which many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, that is, their outputs vary from the smallest pieces of characters to the su… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(2 citation statements)
references
References 42 publications
0
1
0
Order By: Relevance
“…Ding et al (2019); Gowda and May (2020) examine the effect of BPE vocabulary size and Bogoychev and Chen (2021) experiment with using BPE trained on a different domain and is therefore suboptimal for the primary one. Tokenization of the training data is well-known to affect machine translation and other NLP model performance (Domingo et al, 2023;Toraman et al, 2023;Zouhar et al, 2023).…”
Section: Arxiv:240116055v1 [Cscl] 29 Jan 2024 2 Related Workmentioning
confidence: 99%
“…Ding et al (2019); Gowda and May (2020) examine the effect of BPE vocabulary size and Bogoychev and Chen (2021) experiment with using BPE trained on a different domain and is therefore suboptimal for the primary one. Tokenization of the training data is well-known to affect machine translation and other NLP model performance (Domingo et al, 2023;Toraman et al, 2023;Zouhar et al, 2023).…”
Section: Arxiv:240116055v1 [Cscl] 29 Jan 2024 2 Related Workmentioning
confidence: 99%
“…The novelty of the model has also been augmented due to the inclusion of the language model, aimed at enhancing the performance of the model. The Zemberek library was chosen because it is frequently preferred for Turkish text pre-processing ( Akın, Demir & Doğan, 2012 ; Kaya, Fidan & Toroslu, 2012 ; Polat & Oyucu, 2020 ; Toraman et al, 2023 ). To the best of our knowledge, the final model specifically developed for the Turkish language within the study’s scope is not documented in existing literature.…”
Section: Introductionmentioning
confidence: 99%