Lawrence McAfee scite author profile

Lawrence McAfee

3Publications

55Citation Statements Received

74Citation Statements Given

How they've been cited

How they cite others

Affiliations

Stanford University

Publications

Order By: Most citations

A highly scalable Restricted Boltzmann Machine FPGA implementation

Kim

McAfee

McMahon

et al. 2009

View full text Add to dashboard Cite

Restricted Boltzmann Machines (RBMs) -the building block for newly popular Deep Belief Networks (DBNs)are a promising new tool for machine learning practitioners. However, future research in applications of DBNs is hampered by the considerable computation that training requires. In this paper, we describe a novel architecture and FPGA implementation that accelerates the training of general RBMs in a scalable manner, with the goal of producing a system that machine learning researchers can use to investigate ever-larger networks.Our design uses a highly efficient, fully-pipelined architecture based on 16-bit arithmetic for performing RBM training on an FPGA. We show that only 16-bit arithmetic precision is necessary, and we consequently use embedded hardware multiply-and-add (MADD) units. We present performance results to show that a speedup of 25-30X can be achieved over an optimized software implementation on a high-end CPU.

show abstract

EMEURO: A framework for generating multi-purpose accelerators via deep learning

McAfee

Olukotun

2015

View full text Add to dashboard Cite

Approximate computing is a very promising design paradigm for crossing the CPU power wall, primarily driven by the potential to sacrifice output quality for significant gains in performance, energy, and fault tolerance. Unfortunately, existing solutions have primarily either focused on new programming models, or new hardware designs, leaving significant room between these two ends for software-based optimizations. To fill this void, additional efforts should target the compilation and runtime stages, which have a critical impact on controlling the interactions of the many approximate subcomputations to form a well-optimized application.This paper presents EMEURO, a neural-network (NN) based emulation and acceleration platform. By restructuring algorithms to have the same data flow as a NN, EMEURO is able to achieve significant speedup across several domains with minimal design effort. EMEURO uses novel NN-based approximate computing techniques, including methods for efficiently searching the high-dimension subroutine space, and fine-grain control of error during runtime. EMEURO is able to achieve 7x-109x maximum speedup over the original algorithm with 0.1%-10% approximation error. 2015 IEEE/ACM International Symposium on Code Generation and Optimization 978-1-4799-8161-8/15/$31.00 c 2015 IEEE 125 1-4799-8161-8/15/$31.00 ©2015 IEEE

show abstract

Reducing Activation Recomputation in Large Transformer Models

Korthikanti¹,

Casper²,

Lym³

et al. 2022

Preprint

View full text Add to dashboard Cite

Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capacity constraints. Rather than storing activations for backpropagation, they are traditionally recomputed, which saves memory but adds redundant compute. In this work, we show most of this redundant compute is unnecessary because we can reduce memory consumption sufficiently without it. We present two novel yet very simple techniques: sequence parallelism and selective activation recomputation. In conjunction with tensor parallelism, these techniques almost eliminate the need to recompute activations. We evaluate our approach on language models up to one trillion parameters in scale and show that our method reduces activation memory by 5×, while reducing execution time overhead from activation recomputation by over 90%. For example, when training a 530B parameter GPT-3 style model [20] on 2240 NVIDIA A100 GPUs, we achieve a Model Flops Utilization of 54.2%, which is 29% faster than the 42.1% we achieve using recomputation. Our implementation will be available in both Megatron-LM 1 and NeMo-Megatron 2 .

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Lawrence McAfee

A highly scalable Restricted Boltzmann Machine FPGA implementation

EMEURO: A framework for generating multi-purpose accelerators via deep learning

Reducing Activation Recomputation in Large Transformer Models

Contact Info

Product

Resources

About