CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Clark, Jonathan H.; Garrette, Dan; Turc, Iulia; Wieting, John

doi:10.48550/arxiv.2103.06874

Cited by 7 publications

(13 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One application of Perceiver IO is byte-level language processing, which has concurrently been addressed by several other groups. [14] trains models on Unicode code points and shows results competitive with subword-based models on a multilingual question answering dataset. [83] trains on UTF-8 bytes directly by introducing a hand-designed module that is trained end-to-end to perform subword tokenization and produces results on-par with and sometimes better than subword-based models.…”

Section: Related Workmentioning

confidence: 97%

“…Unlike recent language understanding models such as BERT [21] or XLNet [96], Perceiver IO scales effectively with the input length (the Perceiver's latent size does not depend on the input length). For a given FLOPs budget, this allows us to train a tokenizer-free language model that matches the performance of a baseline model trained with a SentencePiece tokenizer, hence removing the need for hand-crafted and potentially harmful tokenization schemes [6,14].…”

Section: Languagementioning

confidence: 99%

See 1 more Smart Citation

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Jaegle¹,

Borgeaud²,

Alayrac³

et al. 2021

Preprint

View full text Add to dashboard Cite

The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves strong results on tasks with highly structured output spaces, such as natural language and visual understanding, StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation. Code: https://dpmd.ai/perceiver-code Preprint. Under review.

show abstract

Section: Related Workmentioning

confidence: 97%

Section: Languagementioning

confidence: 99%

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Jaegle¹,

Borgeaud²,

Alayrac³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Other related architectures, like Luna (Ma et al, 2021), can in principle be applied to autoregressive modeling, but are in practice only viable in an encoder-decoder context. Several other architectures (Dai et al, 2020;Nawrot et al, 2021;Clark et al, 2021) reduce the processing requirements of Transformers by sequentially compressing the input with attention or convolution, but still rely on the use of several large, quadratic-complexity attention layers and exploit locality assumptions that limit their generality. This makes them good candidates for more efficient architectures when applied to input chunks of similar size as normal Transformers (i.e.…”

Section: D1 Efficient Architecturesmentioning

confidence: 99%

General-purpose, long-context autoregressive modeling with Perceiver AR

Hawthorne¹,

Jaegle²,

Cangea³

et al. 2022

Preprint

View full text Add to dashboard Cite

Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this longrange structure. We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking. Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms. When trained on images or music, Perceiver AR generates outputs with clear long-term coherence and structure. Our architecture also obtains state-of-the-art likelihood on long-sequence benchmarks, including 64 × 64 ImageNet images and PG-19 books.* Equal contribution † Equal contribution 1 Google Research, Brain Team 2 DeepMind.

show abstract

“…CANINE, Character Architecture with No tokenization In Neural Encoders, is a tokenizer-free pre-trained encoder model that is designed to overcome the shortcomings of the tokenization process such as word-piece and sentence-piece tokenization [48]. For example, a pre-trained model that uses specific tokenization may not be convenient for specialized domains.…”

Section: Caninementioning

confidence: 99%

An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification

Demirkıran¹,

Çayır²,

Ünal³

et al. 2021

Preprint

View full text Add to dashboard Cite

Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Hence, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships among API calls. Unlike traditional machine and deep learning models, the transformer-based models process the sequences in whole and learn relationships among API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the transformer model with one transformer block layer surpass the performance of the widely used base architecture, LSTM. Moreover, BERT or CANINE, the pre-trained transformer models, outperforms in classifying highly imbalanced malware families according to evaluation metrics: F1-score and AUC score. Furthermore, our proposed bagging-based random transformer forest (RTF) model, an ensemble of BERT or CANINE, reaches the state-of-the-art evaluation scores on the three out of four datasets, specifically it captures a state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset.

show abstract

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Cited by 7 publications

References 45 publications

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Perceiver IO: A General Architecture for Structured Inputs & Outputs

General-purpose, long-context autoregressive modeling with Perceiver AR

An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification

Contact Info

Product

Resources

About