2022
DOI: 10.1162/tacl_a_00448
|View full text |Cite
|
Sign up to set email alerts
|

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Abstract: Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character seq… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
74
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 75 publications
(77 citation statements)
references
References 54 publications
2
74
0
1
Order By: Relevance
“…BERT, for example, has been proven sensitive to (non-adversarial) human noise Sun et al (2020); Kumar et al (2020). Examples of models that can be more resilient to noise include typological language models Gerz et al (2018); Ponti et al (2019), sub-word or character-level language models Kim et al (2016); Zhu et al (2019); Ma et al (2020), byte-pair encoding Sennrich et al (2016), and their extension in recent tokenization-free models (Heinzerling and Strube, 2018;Clark et al, 2021;Xue et al, 2021), yet their use as noise-resilient language models remains to be fully assessed.…”
Section: Related Workmentioning
confidence: 99%
“…BERT, for example, has been proven sensitive to (non-adversarial) human noise Sun et al (2020); Kumar et al (2020). Examples of models that can be more resilient to noise include typological language models Gerz et al (2018); Ponti et al (2019), sub-word or character-level language models Kim et al (2016); Zhu et al (2019); Ma et al (2020), byte-pair encoding Sennrich et al (2016), and their extension in recent tokenization-free models (Heinzerling and Strube, 2018;Clark et al, 2021;Xue et al, 2021), yet their use as noise-resilient language models remains to be fully assessed.…”
Section: Related Workmentioning
confidence: 99%
“…Second, following Banar et al (2020), we use the convolutional character processing layers proposed by Lee et al (2017). Third, we replace the convolutions with local self-attention as proposed in the CANINE model (Clark et al, 2021). Finally, we use the recently proposed Charformer architecture (Tay et al, 2021).…”
Section: Evaluated Modelsmentioning
confidence: 99%
“…CANINE. Clark et al (2021) experiment with character-level pre-trained sentence representations. The character-processing architecture is in principle similar to Lee et al (2017) but uses more modern building blocks.…”
Section: Evaluated Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…On the other hand, many adaptation techniques have focused on improving representation of the target language by modifying the model's vocabulary or tokenization schemes (Chung et al, 2020;Clark et al, 2021;Wang et al, 2021). This is well-motivated: Artetxe et al (2020) emphasize representation in the vocabulary as a key factor for effective crosslingual transfer, while Rust et al (2020) find that MBERT's tokenization scheme for many languages is subpar.…”
Section: Related Workmentioning
confidence: 99%