2020
DOI: 10.48550/arxiv.2006.03555
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

Abstract: Transformer models have achieved state-of-the-art results across a diverse range of domains. However, concern over the cost of training the attention mechanism to learn complex dependencies between distant inputs continues to grow. In response, solutions that exploit the structure and sparsity of the learned attention matrix have blossomed. However, real-world applications that involve long sequences, such as biological sequence analysis, may fall short of meeting these assumptions, precluding exploration of t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
33
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 18 publications
(33 citation statements)
references
References 21 publications
0
33
0
Order By: Relevance
“…2 ), 𝑙 = 1, 𝑓 1 = exp and thus guarantees unbiased and nonnegative approximation of dot-product attention. This approach is more stable than Choromanski et al [18] and reports better approximation results.…”
Section: Feature Maps Linear Transformermentioning
confidence: 68%
See 2 more Smart Citations
“…2 ), 𝑙 = 1, 𝑓 1 = exp and thus guarantees unbiased and nonnegative approximation of dot-product attention. This approach is more stable than Choromanski et al [18] and reports better approximation results.…”
Section: Feature Maps Linear Transformermentioning
confidence: 68%
“…Performer [18,19] uses random feature maps that approximate the scoring function of Transformer. The random feature maps take functions 𝑓 1 , • • • , 𝑓 𝑙 : R → R and ℎ : R 𝐷 → R.…”
Section: Feature Maps Linear Transformermentioning
confidence: 99%
See 1 more Smart Citation
“…In the early days of neural networks, fixed random layers (Baum, 1988;Schmidt et al, 1992;Pao et al, 1994) have been studied in reservoir computing (Maass et al, 2002;Jaeger, 2003;Lukoševičius and Jaeger, 2009), "random kitchen sink" kernel machines Recht, 2008, 2009), and so on. Recently, random features have also been extensively explored for modern neural networks in deep reservoir computing networks (Scardapane and Wang, 2017;Gallicchio and Micheli, 2017;Shen et al, 2021), random kernel feature (Peng et al, 2021;Choromanski et al, 2020), and applications in text classification (Conneau et al, 2017;Wieting and Kiela, 2019), summarization (Pilault et al, 2020) and probing (Voita and Titov, 2020). Compressing Transformer.…”
Section: Related Workmentioning
confidence: 99%
“…The success of the Transformer has proven that compounding these SEM's results in a uniquely effective function approximator for even the most complex correlation functions, such as those that determine the structure of natural languages. However, there is also a growing body of evidence [9][10][11][12][13][14][15] that many of these computations are superfluous and that many state-of-the-art results can be reproduced with significantly fewer learnable parameters, making computations more efficient and generally leading to faster training and better performing models Optimizing the Transformer is currently an active field of research, and currently many of the most effective methods involve complicated rearrangements of traditional architectures. In a recent work [16], the authors presented a uniquely simplified variation on the standard autoencoding Transformer architecture, in which they substitute several self-attention sublayers with a computationally trivial procedure for mixing tokens using Fourier transform coefficients, thus benefiting from the machinery of FFT algorithms such as Cooley-Tukey.…”
Section: Introductionmentioning
confidence: 99%