In-Register Parameter Caching for Dynamic Neural Nets with Virtual Persistent Processor Specialization

Khorasani, Farzad; Esfeden, Hodjat Asghari; Abu-Ghazaleh, Nael; Sarkar, Vivek

doi:10.1109/micro.2018.00038

Cited by 16 publications

(5 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Hardware Accelerators: Data-flow execution models using GPUs [46,37], FPGAs [101,107,120,64,60] and ASICs [42,4,21,115] are more efficient choices for CNNs than traditional CPUs. Among these, FPGAs are more flexible compared to ASICs and more efficient than GPUs.…”

Section: Power-efficient Cnnsmentioning

confidence: 99%

An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

Salami¹,

Onural

Yüksel

et al. 2020

2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

View full text Add to dashboard Cite

We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect of environmental temperature on the reliability-power trade-off of such accelerators.We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W ) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.

show abstract

Section: Power-efficient Cnnsmentioning

confidence: 99%

An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

Salami¹,

Onural

Yüksel

et al. 2020

2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

View full text Add to dashboard Cite

show abstract

“…GPUrdma [65] proposed a matrix-vector product persistent kernel holding a constant matrix in shared memory. Khorasani et al [67] use persistent threads to keep parameters in cache. Zhu et al [68] proposed a sparse persistent implementation of recurrent neural networks.…”

Section: Related Workmentioning

confidence: 99%

Persistent Kernels for Iterative Memory-bound GPU Applications

Zhang¹,

Wahib²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts as the barrier required after advancing the solution every time step. We propose a scheme for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this scheme the time loop is moved inside a persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching a subset of the output in each time step in registers and shared memory to be used as input for the following time step. PERKS can be generalized to any iterative solver: they are largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate the effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geometric mean speedup of 2.29x in small domains and 1.53x in large domains), and a Krylov subspace solver (geometric mean speedup of 4.67x in smaller SpMV datasets from SuiteSparse and 1.39x in larger SpMV datasets, for conjugate gradient).

show abstract

“…Training [57]. Cross-layer approaches related to our work include high- [52] and low- [29] level code generation techniques, and also memory management [23] and memory partitioning techniques [25], [35], [53]. There have been some recent works on SIMD and in particular looking at AVX extensions.…”

Section: Verma Et Al Present a Workload Characterization Of Mlperfmentioning

confidence: 99%

Understanding the Limits of Conventional Hardware Architectures for Deep-Learning

Davies¹,

Labiosa²,

Sankaralingam³

2021

Preprint

View full text Add to dashboard Cite

Deep learning and hardware for it has garnered immense academic and industry interest in the past 5 yearsincluding almost 100 startups, more than $5B of VC investment -and a re-relevance of the role of architecture. However, the state-of-art remains NVIDIA's TensorCore-based systems that provide i) top-of-line performance, ii) turnkey software stack, and iii) coverage across a wide-spectrum of DL network styles (DL-architecture in AI parlance). Other academic and industry efforts have included novel approaches like spatial dataflow, CGRAs, systolic arrays, blended FPGA LUTs with fixed function units and more. These have all necessitated their own innovations in architecture, compiler, and software stack integration. However, none of these have yet satisfied all the 3 metrics that NVIDIA's TensorCore and software stack provides, and generally seem to perform worse. In this paper, we systematically investigate the behavior of DL workloads and imputed needs on hardware/compiler/software. We show that SIMD/shortvector, caching, and synchronization in a fairly well-understood multicore chip organization we call UPCYCLE can achieve dayzero software maturity, and provide big integer factor speedups over the state-of-art NVIDIA solutions. Compared to an A100, UPCYCLE at small-batch size is geo-mean 7.7X faster for inference, geo-mean 11.7X faster at training, while consuming only half the power. Second, the UPCYCLE architecture requires no new compiler or software stack innovation. Third, it provides full DL-architecture coverage, and can be instantiated to provide training-optimized, inference-optimized, or balanced training and inference systems. Overall, this paper motivates the treatment of software maturity as a first class design constraint in developing new architectures for DL.

show abstract

In-Register Parameter Caching for Dynamic Neural Nets with Virtual Persistent Processor Specialization

Cited by 16 publications

References 48 publications

An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

Persistent Kernels for Iterative Memory-bound GPU Applications

Understanding the Limits of Conventional Hardware Architectures for Deep-Learning

Contact Info

Product

Resources

About