Tokenization and the Noiseless Channel

Zouhar, Vilém; Meister, Clara; Gastaldi, Juan Luis; Li, Du; Sachan, Mrinmaya; Cotterell, Ryan

doi:10.18653/v1/2023.acl-long.284

Cited by 1 publication

(1 citation statement)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ding et al (2019); Gowda and May (2020) examine the effect of BPE vocabulary size and Bogoychev and Chen (2021) experiment with using BPE trained on a different domain and is therefore suboptimal for the primary one. Tokenization of the training data is well-known to affect machine translation and other NLP model performance (Domingo et al, 2023;Toraman et al, 2023;Zouhar et al, 2023).…”

Section: Arxiv:240116055v1 [Cscl] 29 Jan 2024 2 Related Workmentioning

confidence: 99%

Leveraging Neural Machine Translation for Word Alignment

Zouhar¹,

Pylypenko²

2021

PBML

View full text Add to dashboard Cite

In learning-based functionality stealing, the attacker is trying to build a local model based on the victim's outputs. The attacker has to make choices regarding the local model's architecture, optimization method and, specifically for NLP models, subword vocabulary, such as BPE. On the machine translation task, we explore (1) whether the choice of the vocabulary plays a role in model stealing scenarios and (2) if it is possible to extract the victim's vocabulary. We find that the vocabulary itself does not have a large effect on the local model's performance. Given gray-box model access, it is possible to collect the victim's vocabulary by collecting the outputs (detokenized subwords on the output). The results of the minimum effect of vocabulary choice are important more broadly for black-box knowledge distillation.

show abstract