ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Xue, Linting; Barua, Aditya; Constant, Noah; Al‐Rfou, Rami; Narang, Sharan; Kale, Mihir; Roberts, Adam; Raffel, Colin

doi:10.1162/tacl_a_00461

Cited by 133 publications

(125 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…So the code point 353 of the letter "š" is translated into two bytes 197 and 161 while the letter "s" retains byte 115. [8] showed better results using a transformer model ByT5 at this byte-level tokens rather than on characters. Inspired by their success on transliteration and noisy text tasks, we also use the same byte-level tokenization.…”

Section: Tokensmentioning

confidence: 93%

“…Three years later, now there are plenty of similarly pre-trained publicly available models (e.g., at HuggingFace transformers library [94]). We also build our work on top of one such pre-trained ByT5 [8] model.…”

Section: Transformer Modelsmentioning

confidence: 99%

“…ByT5 [8] is a general-purpose pre-trained multilingual text-to-text model, based on earlier predecessor mT5 [100]. It completely disposes of SentencePiece [101] tokenizer, as it does not need any.…”

Section: Byt5 Modelmentioning

confidence: 99%

“…However, due to the higher "energy" (or "temperature") in the optimization the high η causes "bouncing" of parameter values and prevents settling in the best spot, resulting in the higher final training loss. An optimal learning rate value, as used during fine-tuning of the T5 family models [99,105,100,8] with Adafactor optimizer, is 0.001.…”

Section: Learning Ratementioning

confidence: 99%

“…We used Adafactor [103] optimizer with a constant learning rate of 0.001. The same setup was employed by ByT5 [8] authors for fine-tuning experiments. Moreover, the Adafactor optimizer also has very little auxiliary storage compared to the other popular optimizer Adam [102].…”

Section: Byt5 Model Fine-tuningmentioning

confidence: 99%

See 4 more Smart Citations

Correcting diacritics and typos with a ByT5 transformer model

Stankevičius,

Lukoševičius,

Kapočiūtė-Dzikienė

et al. 2022

Preprint

View full text Add to dashboard Cite

Due to the fast pace of life and online communications, the prevalence of English and the QWERTY keyboard, people tend to forgo using diacritics, make typographical errors (typos) when typing. Restoring diacritics and correcting spelling is important for proper language use and disambiguation of texts for both humans and downstream algorithms. However, both of these problems are typically addressed separately, i.e., state-of-the-art diacritics restoration methods do not tolerate other typos. In this work, we tackle both problems at once by employing newly-developed ByT5 byte-level transformer models. Our simultaneous diacritics restoration and typos correction approach demonstrates near state-of-the-art performance in 13 languages, reaching >96% of the alpha-word accuracy. We also perform diacritics restoration alone on 12 benchmark datasets with the additional one for the Lithuanian language. The experimental investigation proves that our approach is able to achieve comparable results (>98%) to previously reported despite being trained on fewer data. Our approach is also able to restore diacritics in words not seen during training with >76% accuracy. We also show the accuracies to further improve with longer training. All this shows a great real-world application potential of our suggested methods to more data, languages, and error classes.

show abstract

Section: Tokensmentioning

confidence: 93%

Section: Transformer Modelsmentioning

confidence: 99%

Section: Byt5 Modelmentioning

confidence: 99%

Section: Learning Ratementioning

confidence: 99%

Section: Byt5 Model Fine-tuningmentioning

confidence: 99%

See 3 more Smart Citations

Correcting diacritics and typos with a ByT5 transformer model

Stankevičius,

Lukoševičius,

Kapočiūtė-Dzikienė

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Incremental Vocabularies in Machine Translation Through Aligned Embedding Projections

Carrión

Casacuberta

2022

Pattern Recognition and Image Analysis

View full text Add to dashboard Cite

Using Pseudo-Labelled Data for Zero-Shot Text Classification

Wang

Nulty

Lillis

2022

Natural Language Processing and Information Systems

View full text Add to dashboard Cite

Existing Zero-Shot Learning (ZSL) techniques for text classification typically assign a label to a piece of text by building a matching model to capture the semantic similarity between the text and the label descriptor. This is expensive at inference time as it requires the text paired with every label to be passed forward through the matching model. The existing approaches to alleviate this issue are based on exact-word matching between the label surface names and an unlabelled target-domain corpus to get pseudo-labelled data for model training, making them difficult to generalise to ZS classification in multiple domains, In this paper, we propose an approach called P-ZSC to leverage pseudo-labelled data for zero-shot text classification. Our approach generates the pseudo-labelled data through a matching algorithm between the unlabelled target-domain corpus and the label vocabularies that consist of in-domain relevant phrases via expansion from label names. By evaluating our approach on several benchmarking datasets from a variety of domains, the results show that our system substantially outperforms the baseline systems especially in datasets whose classes are imbalanced.

show abstract

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Cited by 133 publications

References 31 publications

Correcting diacritics and typos with a ByT5 transformer model

Correcting diacritics and typos with a ByT5 transformer model

Incremental Vocabularies in Machine Translation Through Aligned Embedding Projections

Using Pseudo-Labelled Data for Zero-Shot Text Classification

Contact Info

Product

Resources

About