Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

Choromański, Krzysztof; Likhosherstov, Valerii; Dohan, D.; Song, Xingyou; Gane, Andreea; Sarlós, Tamás; Hawkins, Peter; Davis, J. K.; Belanger, David; Colwell, Lucy; Weller, Adrian

doi:10.48550/arxiv.2006.03555

Cited by 18 publications

(33 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2 ), 𝑙 = 1, 𝑓 1 = exp and thus guarantees unbiased and nonnegative approximation of dot-product attention. This approach is more stable than Choromanski et al [18] and reports better approximation results.…”

Section: Feature Maps Linear Transformermentioning

confidence: 68%

“…Performer [18,19] uses random feature maps that approximate the scoring function of Transformer. The random feature maps take functions 𝑓 1 , • • • , 𝑓 𝑙 : R → R and ℎ : R 𝐷 → R.…”

Section: Feature Maps Linear Transformermentioning

confidence: 99%

“…The first version of Performer [18] is inspired from the random Fourier feature map [105] that was originally used to approximate Gaussian kernel. It uses trigonometric functions with…”

Section: Feature Maps Linear Transformermentioning

confidence: 99%

See 2 more Smart Citations

A Survey of Transformers

Lin,

Wang,

Liu

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformers have achieved great success in many artificial intelligence fields, such as natural language processing, computer vision, and audio processing. Therefore, it is natural to attract lots of interest from academic and industry researchers. Up to the present, a great variety of Transformer variants (a.k.a. X-formers) have been proposed, however, a systematic and comprehensive literature review on these Transformer variants is still missing. In this survey, we provide a comprehensive review of various X-formers. We first briefly introduce the vanilla Transformer and then propose a new taxonomy of X-formers. Next, we introduce the various X-formers from three perspectives: architectural modification, pre-training, and applications. Finally, we outline some potential directions for future research.

show abstract

Section: Feature Maps Linear Transformermentioning

confidence: 68%

Section: Feature Maps Linear Transformermentioning

confidence: 99%

See 1 more Smart Citation

A Survey of Transformers

Lin,

Wang,

Liu

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In the early days of neural networks, fixed random layers (Baum, 1988;Schmidt et al, 1992;Pao et al, 1994) have been studied in reservoir computing (Maass et al, 2002;Jaeger, 2003;Lukoševičius and Jaeger, 2009), "random kitchen sink" kernel machines Recht, 2008, 2009), and so on. Recently, random features have also been extensively explored for modern neural networks in deep reservoir computing networks (Scardapane and Wang, 2017;Gallicchio and Micheli, 2017;Shen et al, 2021), random kernel feature (Peng et al, 2021;Choromanski et al, 2020), and applications in text classification (Conneau et al, 2017;Wieting and Kiela, 2019), summarization (Pilault et al, 2020) and probing (Voita and Titov, 2020). Compressing Transformer.…”

Section: Related Workmentioning

confidence: 99%

What's Hidden in a One-layer Randomly Weighted Transformer?

Shen¹,

Yao²,

Keutzer³

et al. 2021

Preprint

View full text Add to dashboard Cite

We demonstrate that, hidden within one-layer randomly weighted neural networks, there exist subnetworks that can achieve impressive performance, without ever modifying the weight initializations, on machine translation tasks. To find subnetworks for onelayer randomly weighted neural networks, we apply different binary masks to the same weight matrix to generate different layers. Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29.45/17.29 BLEU on IWSLT14/WMT14. Using a fixed pretrained embedding layer, the previously found subnetworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformer small/base on IWSLT14/WMT14. Furthermore, we demonstrate the effectiveness of larger and deeper transformers in this setting, as well as the impact of different initialization methods. 1

show abstract

“…The success of the Transformer has proven that compounding these SEM's results in a uniquely effective function approximator for even the most complex correlation functions, such as those that determine the structure of natural languages. However, there is also a growing body of evidence [9][10][11][12][13][14][15] that many of these computations are superfluous and that many state-of-the-art results can be reproduced with significantly fewer learnable parameters, making computations more efficient and generally leading to faster training and better performing models Optimizing the Transformer is currently an active field of research, and currently many of the most effective methods involve complicated rearrangements of traditional architectures. In a recent work [16], the authors presented a uniquely simplified variation on the standard autoencoding Transformer architecture, in which they substitute several self-attention sublayers with a computationally trivial procedure for mixing tokens using Fourier transform coefficients, thus benefiting from the machinery of FFT algorithms such as Cooley-Tukey.…”

Section: Introductionmentioning

confidence: 99%

FNetAR: Mixing Tokens with Autoregressive Fourier Transforms

Lou¹,

Park²,

Ramezanali³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this note we examine the autoregressive generalization of the FNet algorithm, in which selfattention layers from the standard Transformer architecture are substituted with a trivial sparse-uniform sampling procedure based on Fourier transforms. Using the Wikitext-103 benchmark, we demonstrate that FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modeling compared to a Transformer-XL baseline (24.2 ppl) with only half the number self-attention layers, thus providing further evidence for the superfluity of deep neural networks with heavily compounded attention mechanisms. The autoregressive Fourier transform could likely be used for parameter reduction on most Transformer-based time-series prediction models.

show abstract

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

Cited by 18 publications

References 21 publications

A Survey of Transformers

A Survey of Transformers

What's Hidden in a One-layer Randomly Weighted Transformer?

FNetAR: Mixing Tokens with Autoregressive Fourier Transforms

Contact Info

Product

Resources

About