Impact of Tokenization on Language Models: An Analysis for Turkish

Toraman, Çağrı; Yılmaz, Eyüp Halit; Şahi̇nuç, Furkan; Ozcelik, Oguzhan

doi:10.1145/3578707

Cited by 16 publications

(2 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ding et al (2019); Gowda and May (2020) examine the effect of BPE vocabulary size and Bogoychev and Chen (2021) experiment with using BPE trained on a different domain and is therefore suboptimal for the primary one. Tokenization of the training data is well-known to affect machine translation and other NLP model performance (Domingo et al, 2023;Toraman et al, 2023;Zouhar et al, 2023).…”

Section: Arxiv:240116055v1 [Cscl] 29 Jan 2024 2 Related Workmentioning

confidence: 99%

Leveraging Neural Machine Translation for Word Alignment

Zouhar¹,

Pylypenko²

2021

PBML

View full text Add to dashboard Cite

In learning-based functionality stealing, the attacker is trying to build a local model based on the victim's outputs. The attacker has to make choices regarding the local model's architecture, optimization method and, specifically for NLP models, subword vocabulary, such as BPE. On the machine translation task, we explore (1) whether the choice of the vocabulary plays a role in model stealing scenarios and (2) if it is possible to extract the victim's vocabulary. We find that the vocabulary itself does not have a large effect on the local model's performance. Given gray-box model access, it is possible to collect the victim's vocabulary by collecting the outputs (detokenized subwords on the output). The results of the minimum effect of vocabulary choice are important more broadly for black-box knowledge distillation.

show abstract

Section: Arxiv:240116055v1 [Cscl] 29 Jan 2024 2 Related Workmentioning

confidence: 99%

Leveraging Neural Machine Translation for Word Alignment

Zouhar¹,

Pylypenko²

2021

PBML

View full text Add to dashboard Cite

show abstract

“…The novelty of the model has also been augmented due to the inclusion of the language model, aimed at enhancing the performance of the model. The Zemberek library was chosen because it is frequently preferred for Turkish text pre-processing ( Akın, Demir & Doğan, 2012 ; Kaya, Fidan & Toroslu, 2012 ; Polat & Oyucu, 2020 ; Toraman et al, 2023 ). To the best of our knowledge, the final model specifically developed for the Turkish language within the study’s scope is not documented in existing literature.…”

Section: Introductionmentioning

confidence: 99%

Customized deep learning based Turkish automatic speech recognition system supported by language model

Görmez

2024

PeerJ Computer Science

View full text Add to dashboard Cite

Background In today’s world, numerous applications integral to various facets of daily life include automatic speech recognition methods. Thus, the development of a successful automatic speech recognition system can significantly augment the convenience of people’s daily routines. While many automatic speech recognition systems have been established for widely spoken languages like English, there has been insufficient progress in developing such systems for less common languages such as Turkish. Moreover, due to its agglutinative structure, designing a speech recognition system for Turkish presents greater challenges compared to other language groups. Therefore, our study focused on proposing deep learning models for automatic speech recognition in Turkish, complemented by the integration of a language model. Methods In our study, deep learning models were formulated by incorporating convolutional neural networks, gated recurrent units, long short-term memories, and transformer layers. The Zemberek library was employed to craft the language model to improve system performance. Furthermore, the Bayesian optimization method was applied to fine-tune the hyper-parameters of the deep learning models. To evaluate the model’s performance, standard metrics widely used in automatic speech recognition systems, specifically word error rate and character error rate scores, were employed. Results Upon reviewing the experimental results, it becomes evident that when optimal hyper-parameters are applied to models developed with various layers, the scores are as follows: Without the use of a language model, the Turkish Microphone Speech Corpus dataset yields scores of 22.2 -word error rate and 14.05-character error rate, while the Turkish Speech Corpus dataset results in scores of 11.5 -word error rate and 4.15 character error rate. Upon incorporating the language model, notable improvements were observed. Specifically, for the Turkish Microphone Speech Corpus dataset, the word error rate score decreased to 9.85, and the character error rate score lowered to 5.35. Similarly, the word error rate score improved to 8.4, and the character error rate score decreased to 2.7 for the Turkish Speech Corpus dataset. These results demonstrate that our model outperforms the studies found in the existing literature.

show abstract

ChatGPT versus Bard: A comparative study

Ahmed,

Kajol,

Hasan

et al. 2024

Engineering Reports

View full text Add to dashboard Cite

The rapid progress in conversational AI has given rise to advanced language models capable of generating human‐like texts. Among these models, ChatGPT and Bard, developed by OpenAI and Google AI respectively, have gained significant attention. With their wide range of functionalities, such as human‐like response generation, proficiency in professional exams, complex problem solving, and more, these models have captured interest. This study presents a comprehensive survey exploring and comparing the capabilities and features of ChatGPT and Bard. We delve into their architectures, training methodologies, performance evaluations, and limitations across various domains. Ethical considerations such as biases and potential misconduct are also examined. Our findings highlight ChatGPT's exceptional performance, positioning it as a leading model. This survey is a vital resource for scholars, innovators, and interested parties operating within the domain of conversational artificial intelligence, offering valuable insights for the advancement of cutting‐edge language models.

show abstract

Impact of Tokenization on Language Models: An Analysis for Turkish

Cited by 16 publications

References 42 publications

Leveraging Neural Machine Translation for Word Alignment

Leveraging Neural Machine Translation for Word Alignment

Customized deep learning based Turkish automatic speech recognition system supported by language model

ChatGPT versus Bard: A comparative study

Contact Info

Product

Resources

About