FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao, Tri; Fu, Daniel Y.; Ermon, Stefano; Rudra, Atri; Ré, Christopher

doi:10.48550/arxiv.2205.14135

Cited by 32 publications

(37 citation statements)

References 33 publications

(58 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These optimizations create trade-offs between memory consumption and speed that can be tuned differently for training and inference. They include advanced implementations of neural network attention mechanisms (Vaswani et al 2017) with favorable properties for unusually short and long sequences (Rabe and Staats 2021, Dao et al 2022), module refactoring for lower memory usage, optional approximations of certain computations that reduce the memory burden, and specialized low-level code customized for GPU hardware. For technical details see appendices F.1 and F.2.…”

Section: Resultsmentioning

confidence: 99%

“…FlashAttention: We incorporate FlashAttention (Dao et al 2022), an efficient fused attention implementation that tiles computation in order to reduce data movement between different levels of GPU memory, greatly improving peak memory usage and runtime in the process. We find it to be particularly effective for short sequences with 1,000 residues or less, on which it contributes to an OpenFold speedup of up to 15% despite only being compatible with a small number of the attention modules in the network.…”

Section: Appendix a Related Workmentioning

confidence: 99%

“…Apart from new training code and data, OpenFold has several advantages over AlphaFold2: (i) it runs up to three times faster on most proteins, (ii) it uses less memory, allowing prediction of extremely long proteins and multi-protein complexes on a single GPU, and (iii) it is implemented in PyTorch (Paszke et al 2019), the most widely used machine learning framework (AlphaFold2 uses Google’s JAX (Bradbury et al 2018)). As such, OpenFold can be readily used by the widest community of developers and interfaces with a rich ecosystem of existing machine learning software (Rasley et al 2020, Charlier et al 2021, Falcon et al 2019, Charlier et al 2021, Dao et al 2022).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Ahdritz

Bouatta

Kadyan

et al. 2022

Preprint

127

109

View full text Add to dashboard Cite

AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (i) tackle new tasks, like protein-ligand complex structure prediction, (ii) investigate the process by which the model learns, which remains poorly understood, and (iii) assess the model's generalization capacity to unseen regions of fold space. Here we report OpenFold, a fast, memory-efficient, and trainable implementation of AlphaFold2, and OpenProteinSet, the largest public database of protein multiple sequence alignments. We use OpenProteinSet to train OpenFold from scratch, fully matching the accuracy of AlphaFold2. Having established parity, we assess OpenFold's capacity to generalize across fold space by retraining it using carefully designed datasets. We find that OpenFold is remarkably robust at generalizing despite extreme reductions in training set size and diversity, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced by OpenFold during training, we also gain surprising insights into the manner in which the model learns to fold proteins, discovering that spatial dimensions are learned sequentially. Taken together, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial new resource for the protein modeling community.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Appendix a Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Ahdritz

Bouatta

Kadyan

et al. 2022

Preprint

127

109

View full text Add to dashboard Cite

show abstract

“…This greatly exceeds the input length of common transformers used in NLM. Efficient self-attention techniques can be used (Katharopoulos et al, 2020;Wang et al, 2020;Dao et al, 2022). Also, since the order of the genes is not sequential in scRNA-seq data, and the transformer computation is agnostic to the order, we can dynamically sample subsets of the input.…”

Section: Encoder and Gene Expression Modelingmentioning

confidence: 99%

scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers

Cui

Wang

Maan

et al. 2022

Preprint

View full text Add to dashboard Cite

Single-cell sequencing has emerged as a promising technique to decode cellular heterogeneity and analyze gene functions. With the high throughput of modern techniques and resulting large-scale sequencing data, deep learning has been used extensively to learn representations of individual cells for downstream tasks. However, most existing methods rely on fully connected networks and are unable to model complex relationships between both cell and gene representations. We hereby propose scFormer, a novel transformer-based deep learning framework to jointly optimize cell and gene embeddings for single-cell biology in an unsupervised manner. By drawing parallels between natural language processing and genomics, scFormer applies self-attention to learn salient gene and cell embeddings through masked gene modelling. scFormer provides a unified framework to readily address a variety of downstream tasks such as data integration, analysis of gene function, and perturbation response prediction. Extensive experiments using scFormer show state-of-the-art performance on seven datasets across the relevant tasks. The scFormer model implementation is available at https://github.com/bowang-lab/scFormer.

show abstract

“…Among which, PyTorch provides a standard implementation of MHA [28]; NVIDIA TensorRT provides fused MHA for short sequences whose lengths are smaller than 512 [29]. To scale the fused MHA to long sequences, Stanford researchers propose FlashAttention [30], which assumes identical shapes of inputs and assigns the workload of a whole attention unit to a single CTA. However, FlashAttention brings significant wasted computations if input sequence lengths are variable.…”

Section: B Related Work On DL Accelerationmentioning

confidence: 99%

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs

Zhai¹,

Jiang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformers have become keystone models in natural language processing over the past decade. They have achieved great popularity in deep learning applications, but the increasing sizes of the parameter spaces required by Transformer models generate a commensurate need to accelerate performance. Natural language processing problems can also be routinely faced with variable-length sequences, as word counts commonly vary among sentences. Existing deep learning frameworks pad variable-length sequences to a maximal length, which adds significant memory and computational overhead. In this paper, we present ByteTransformer, a high-performance Transformer boosted for variable-length inputs. We propose a padding-free algorithm that liberates the entire Transformer from redundant computations on wasted padded tokens. In addition to algorithmic-level optimization, we provide architecture-aware optimizations for Transformer functional modules, especially the performance-critical algorithm Multi-Head Attention (MHA). Experimental results on an NVIDIA A100 GPU with variablelength sequence inputs validate that our fused MHA outperforms the standard PyTorch MHA by 6.13x. The end-to-end performance of ByteTransformer for a standard BERT Transformer model surpasses state-of-the-art Transformer frameworks, such as PyTorch JIT, TensorFlow XLA, Tencent TurboTransformer and NVIDIA FasterTransformer, by 87%, 131%, 138% and 46%, respectively.

show abstract

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Cited by 32 publications

References 33 publications

OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs

Contact Info

Product

Resources

About