<scp>Canine</scp>: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Clark, Jonathan H.; Garrette, Dan; Turc, Iulia; Wieting, John

doi:10.1162/tacl_a_00448

Cited by 75 publications

(77 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…BERT, for example, has been proven sensitive to (non-adversarial) human noise Sun et al (2020); Kumar et al (2020). Examples of models that can be more resilient to noise include typological language models Gerz et al (2018); Ponti et al (2019), sub-word or character-level language models Kim et al (2016); Zhu et al (2019); Ma et al (2020), byte-pair encoding Sennrich et al (2016), and their extension in recent tokenization-free models (Heinzerling and Strube, 2018;Clark et al, 2021;Xue et al, 2021), yet their use as noise-resilient language models remains to be fully assessed.…”

Section: Related Workmentioning

confidence: 99%

An Assessment of the Impact of OCR Noise on Language Models

Todorov¹,

Colavizza²

2022

Preprint

View full text Add to dashboard Cite

Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still limited. We perform an assessment of the impact OCR noise has on a variety of language models, using data in Dutch, English, French and German. We find that OCR noise poses a significant obstacle to language modelling, with language models increasingly diverging from their noiseless targets as OCR quality lowers. In the presence of small corpora, simpler models including PPMI and Word2Vec consistently outperform transformer-based models in this respect.

show abstract

Section: Related Workmentioning

confidence: 99%

An Assessment of the Impact of OCR Noise on Language Models

Todorov¹,

Colavizza²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Second, following Banar et al (2020), we use the convolutional character processing layers proposed by Lee et al (2017). Third, we replace the convolutions with local self-attention as proposed in the CANINE model (Clark et al, 2021). Finally, we use the recently proposed Charformer architecture (Tay et al, 2021).…”

Section: Evaluated Modelsmentioning

confidence: 99%

“…CANINE. Clark et al (2021) experiment with character-level pre-trained sentence representations. The character-processing architecture is in principle similar to Lee et al (2017) but uses more modern building blocks.…”

Section: Evaluated Modelsmentioning

confidence: 99%

See 1 more Smart Citation

Why don't people use character-level machine translation?

Libovický¹,

Schmid²,

Fraser³

2021

Preprint

View full text Add to dashboard Cite

We present a literature and empirical survey that critically assesses the state of the art in character-level modeling for machine translation (MT). Despite evidence in the literature that character-level systems are comparable with subword systems, they are virtually never used in competitive setups in WMT competitions. We empirically show that even with recent modeling innovations in characterlevel natural language processing, characterlevel MT systems still struggle to match their subword-based counterparts both in terms of translation quality and training and inference speed. Character-level MT systems show neither better domain robustness, nor better morphological generalization, despite being often so motivated. On the other hand, they tend to be more robust towards source side noise and the translation quality does not degrade with increasing beam size at decoding time.

show abstract

“…On the other hand, many adaptation techniques have focused on improving representation of the target language by modifying the model's vocabulary or tokenization schemes (Chung et al, 2020;Clark et al, 2021;Wang et al, 2021). This is well-motivated: Artetxe et al (2020) emphasize representation in the vocabulary as a key factor for effective crosslingual transfer, while Rust et al (2020) find that MBERT's tokenization scheme for many languages is subpar.…”

Section: Related Workmentioning

confidence: 99%

Specializing Multilingual Language Models: An Empirical Study

Chau,

Smith

2021

Preprint

View full text Add to dashboard Cite

Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks in many different languages, but the success of this approach is far from universal. For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data, motivating additional model adaptations to achieve reasonably strong performance. In this work, we study the performance, extensibility, and interaction of two such adaptations for this low-resource setting: vocabulary augmentation and script transliteration. Our evaluations on a set of three tasks in nine diverse low-resource languages yield a mixed result, upholding the viability of these approaches while raising new questions around how to optimally adapt multilingual models to low-resource settings. 1 We use specialization to denote preparing a model for use on a specific target language, to the exclusion of others. This is a subset of adaptation, which involves all techniques that adjust a model for use on target languages, regardless of their resulting universality.

show abstract

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Cited by 75 publications

References 54 publications

An Assessment of the Impact of OCR Noise on Language Models

An Assessment of the Impact of OCR Noise on Language Models

Why don't people use character-level machine translation?

Specializing Multilingual Language Models: An Empirical Study

Contact Info

Product

Resources

About