2021
DOI: 10.48550/arxiv.2103.06874
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Jonathan H. Clark,
Dan Garrette,
Iulia Turc
et al.

Abstract: Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character seq… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(13 citation statements)
references
References 45 publications
0
13
0
Order By: Relevance
“…One application of Perceiver IO is byte-level language processing, which has concurrently been addressed by several other groups. [14] trains models on Unicode code points and shows results competitive with subword-based models on a multilingual question answering dataset. [83] trains on UTF-8 bytes directly by introducing a hand-designed module that is trained end-to-end to perform subword tokenization and produces results on-par with and sometimes better than subword-based models.…”
Section: Related Workmentioning
confidence: 97%
See 1 more Smart Citation
“…One application of Perceiver IO is byte-level language processing, which has concurrently been addressed by several other groups. [14] trains models on Unicode code points and shows results competitive with subword-based models on a multilingual question answering dataset. [83] trains on UTF-8 bytes directly by introducing a hand-designed module that is trained end-to-end to perform subword tokenization and produces results on-par with and sometimes better than subword-based models.…”
Section: Related Workmentioning
confidence: 97%
“…Unlike recent language understanding models such as BERT [21] or XLNet [96], Perceiver IO scales effectively with the input length (the Perceiver's latent size does not depend on the input length). For a given FLOPs budget, this allows us to train a tokenizer-free language model that matches the performance of a baseline model trained with a SentencePiece tokenizer, hence removing the need for hand-crafted and potentially harmful tokenization schemes [6,14].…”
Section: Languagementioning
confidence: 99%
“…Other related architectures, like Luna (Ma et al, 2021), can in principle be applied to autoregressive modeling, but are in practice only viable in an encoder-decoder context. Several other architectures (Dai et al, 2020;Nawrot et al, 2021;Clark et al, 2021) reduce the processing requirements of Transformers by sequentially compressing the input with attention or convolution, but still rely on the use of several large, quadratic-complexity attention layers and exploit locality assumptions that limit their generality. This makes them good candidates for more efficient architectures when applied to input chunks of similar size as normal Transformers (i.e.…”
Section: D1 Efficient Architecturesmentioning
confidence: 99%
“…CANINE, Character Architecture with No tokenization In Neural Encoders, is a tokenizer-free pre-trained encoder model that is designed to overcome the shortcomings of the tokenization process such as word-piece and sentence-piece tokenization [48]. For example, a pre-trained model that uses specific tokenization may not be convenient for specialized domains.…”
Section: Caninementioning
confidence: 99%