Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration

Genç, Hasan; Kim, Seah; Amid, Alon; Haj-Ali, Ameer; Iyer, Vighnesh; Prakash, Pranav; Zhao, Jerry; Grubb, Daniel; Liew, Harrison; Mao, Howard; Ou, Albert; Colin, Stéphane; Steffl, Samuel; Wright, John; Stoica, Ion; Ragan-Kelley, Jonathan; Asanović, Krste; Nikolić, Borivoje; Shao, Yakun Sophia

doi:10.1109/dac18074.2021.9586216

Cited by 99 publications

(19 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Approach (1) consists of solutions like VeriGOOD-ML [3], which maps ML models described in the ONNX format to three substantially different architecture templates for different types of neural networks through the PolyMath compiler. GEMMINI [5] provides a parametrized systolic array generator in Chisel that connects to a RISC-V core; the GEMMINI toolchain then offloads operations from specific layers of ONNX models to the systolic array. TVM's VTA architecture [11] is a specialized co-processor for matrix multiplication, generated through HLS for FPGA; the TVM high-level framework can compile machine learning models into a stream of instructions for VTA.…”

Section: Related Workmentioning

confidence: 99%

Bridging Python to Silicon: The SODA Toolchain

et al. 2022

View full text Add to dashboard Cite

Systems performing scientific computing, data analysis, and machine learning tasks have a growing demand for application-specific accelerators that can provide high computational performance while meeting strict size and power requirements. However, the algorithms and applications that need to be accelerated are evolving at a rate that is incompatible with manual design processes based on hardware description languages. Agile hardware design tools based on compiler techniques can help by quickly producing an application specific integrated circuit (ASIC) accelerator starting from a high-level algorithmic description. We present the SODA Synthesizer, a modular and open-source hardware compiler that provides automated endto-end synthesis from high-level software frameworks to ASIC implementation, relying on multi-level representations to progressively lower and optimize the input code. Our approach does not require the application developer to write any register-transfer level code, and it is able to reach up to 364 GFLOPS/W efficiency (32-bit precision) on typical convolutional neural network operators.

show abstract

Section: Related Workmentioning

confidence: 99%

Bridging Python to Silicon: The SODA Toolchain

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Energy spent on the entire benchmark model inference is calculated using the average power, performance in terms of number of cycles required to run the benchmark for each design and the frequency of the design. Figure 14 shows PPAE comparison of AI-PiM with the equivalent Gemmini accelerator (Genc et al (2021); Gonzalez and Hong (2020)) with 8 × 8 systolic array. Figure 14 shows that loosely coupled Gemmini systolic array accelerator takes 9.62 times the power, 18.34 times the area and 9.36 higher energy to offer just 3% performance improvement over AI-PiM for the ResNet-50 neural network model.…”

Section: Figure 12mentioning

confidence: 99%

AI-PiM—Extending the RISC-V processor with Processing-in-Memory functional units for AI inference at the edge of IoT

Verma

Stan

2022

Front.Electron.

View full text Add to dashboard Cite

The recent advances in Artificial Intelligence (AI) achieving “better-than-human” accuracy in a variety of tasks such as image classification and the game of Go have come at the cost of exponential increase in the size of artificial neural networks. This has lead to AI hardware solutions becoming severely memory-bound and scrambling to keep-up with the ever increasing “von Neumann bottleneck”. Processing-in-Memory (PiM) architectures offer an excellent solution to ease the von Neumann bottleneck by embedding compute capabilities inside the memory and reducing the data traffic between the memory and the processor. But PiM accelerators break the standard von Neumann programming model by fusing memory and compute operations together which impedes their integration in the standard computing stack. There is an urgent requirement for system-level solutions to take full advantage of PiM accelerators for end-to-end acceleration of AI applications. This article presents AI-PiM as a solution to bridge this research gap. AI-PiM proposes a hardware, ISA and software co-design methodology which allows integration of PiM accelerators in the RISC-V processor pipeline as functional execution units. AI-PiM also extends the RISC-V ISA with custom instructions which directly target the PiM functional units resulting in their tight integration with the processor. This tight integration is especially important for edge AI devices which need to process both AI and non-AI tasks on the same hardware due to area, power, size and cost constraints. AI-PiM ISA extensions expose the PiM hardware functionality to software programmers allowing efficient mapping of applications to the PiM hardware. AI-PiM adds support for custom ISA extensions to the complete software stack including compiler, assembler, linker, simulator and profiler to ensure programmability and evaluation with popular AI domain-specific languages and frameworks like TensorFlow, PyTorch, MXNet, Keras etc. AI-PiM improves the performance for vector-matrix multiplication (VMM) kernel by 17.63x and provides a mean speed-up of 2.74x for MLPerf Tiny benchmark compared to RV64IMC RISC-V baseline. AI-PiM also speeds-up MLPerf Tiny benchmark inference cycles by 2.45x (average) compared to state-of-the-art Arm Cortex-A72 processor.

show abstract

“…VeriGOOD-ML [22] uses the PolyMath compiler [23] to map ML models in the ONNX format to three different architecture templates designed for different types of neural networks. GEMMINI [24] offloads operations from specific layers of ONNX models to a systolic array connected to a RISC-V core, after building the systolic array itself starting from a parametrized generator in Chisel. TVM's VTA architecture [25] is a configurable FPGA co-processor for matrix multiplication; the TVM high-level framework then compiles each ML model into instructions for VTA.…”

Section: Hardware Acceleration For Machine Learningmentioning

confidence: 99%

End-to-End Synthesis of Dynamically Controlled Machine Learning Accelerators

Curzel

Agostini

Castellana

et al. 2022

IEEE Trans. Comput.

View full text Add to dashboard Cite

Edge systems are required to autonomously make real-time decisions based on large quantities of input data under strict power, performance, area, and other constraints. Meeting these constraints is only possible by specializing systems through hardware accelerators purposefully built for machine learning and data analysis algorithms. However, data science evolves at a quick pace, and manual design of custom accelerators has high non-recurrent engineering costs: general solutions are needed to automatically and rapidly transition from the formulation of a new algorithm to the deployment of a dedicated hardware implementation. Our solution is the SOftware Defined Architectures (SODA) Synthesizer, an end-to-end, multi-level, modular, extensible compiler toolchain providing a direct path from machine learning tools to hardware. The SODA Synthesizer frontend is based on the multilevel intermediate representation (MLIR) framework; it ingests pre-trained machine learning models, identifies kernels suited for acceleration, performs high-level optimizations, and prepares them for hardware synthesis. In the backend, SODA leverages state-of-the-art high-level synthesis techniques to generate highly efficient accelerators, targeting both field programmable devices (FPGAs) and applicationspecific circuits (ASICs). In this paper, we describe how the SODA Synthesizer can also assemble the generated accelerators (based on the finite state machine with datapath model) in a custom system driven by a distributed controller, building a coarse-grained dataflow architecture that does not require a host processor to orchestrate parallel execution of multiple accelerators. We show the effectiveness of our approach by automatically generating ASIC accelerators for layers of popular deep neural networks (DNNs). Our high-level optimizations result in up to 74x speedup on isolated accelerators for individual DNN layers, and our dynamically scheduled architecture yields an additional 3x performance improvement when combining accelerators to handle streaming inputs.

show abstract

Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration

Cited by 99 publications

References 26 publications

Bridging Python to Silicon: The SODA Toolchain

Bridging Python to Silicon: The SODA Toolchain

AI-PiM—Extending the RISC-V processor with Processing-in-Memory functional units for AI inference at the edge of IoT

End-to-End Synthesis of Dynamically Controlled Machine Learning Accelerators

Contact Info

Product

Resources

About