Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra

Scheffler, Paul; Zaruba, Florian; Schuiki, Fabian; Hoefler, Torsten; Benini, Luca

doi:10.23919/date51398.2021.9474230

Cited by 7 publications

(4 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One could expect new breakthroughs to enable higher sparsity closer to those in scientific computing (>99.9%). Then, another class of accelerators, such as SpArch , Indirection Stream Semantic Registers [Scheffler et al 2020], or Extensor [Hegde et al 2019] would play a bigger role. 7.2.3 Overview of accelerators for sparse deep learning.…”

Section: Training Acceleratorsmentioning

confidence: 99%

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Hoefler¹,

Alistarh²,

Ben-Nun³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, if not better than, the original dense networks. Sparsity can reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.. The supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience -Albert Einstein, 1933 INTRODUCTIONDeep learning shows unparalleled promise for solving very complex real-world problems in areas such as computer vision, natural language processing, knowledge representation, recommendation systems, drug discovery, and many more. With this development, the field of machine learning is moving from traditional feature engineering to neural architecture engineering. However, still little is known about how to pick the right architecture to solve a specific task. Several methods such as translational equivariance in convolutional layers, recurrence, structured weight sharing, pooling, or locality are used to introduce strong inductive biases in the model design. Yet, the exact model size and capacity required for a task remain unknown and a common strategy is to train overparameterized models and compress them into smaller representations. Biological brains, especially the human brain, are hierarchical, sparse, and recurrent structures [Friston 2008] and one can draw some similarities with the inductive biases in today's artificial neural networks. Sparsity plays an important role in scaling biological brains-the more

show abstract

Section: Training Acceleratorsmentioning

confidence: 99%

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Hoefler¹,

Alistarh²,

Ben-Nun³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In our experiments, the lower bandwidth was compensated by the reduction in the amount of processed data, and the Flare in-network sparse allreduce outperformed the Spar-CML host-based sparse allreduce. However, we believe that there is still space for improvement, either by optimizing the handlers code or by introducing hardware support to optimize indirect memory accesses [84].…”

Section: Discussionmentioning

confidence: 99%

Flare: Flexible In-Network Allreduce

De Sensi,

Di Girolamo,

Ashkboos

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-the-art approaches. CCS CONCEPTS• Networks → In-network processing; • Hardware → Networking hardware; • Computer systems organization → Distributed architectures.

show abstract

“…These works mainly focus on proposing extensions on instruction set architecture, such as new instructions and new core-side micro-architecture, where main idea is to introduce memory stream in ISA and decouple computation and memory access [7,21,39,40,25,41]. For example, Wang al.…”

Section: Core-side Stream Extensionsmentioning

confidence: 99%

AXI-Pack: Near-Memory Bus Packing for Bandwidth-Efficient Irregular Workloads

Zhang

Scheffler

Benz

et al. 2023

2023 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE)

Self Cite

View full text Add to dashboard Cite

General propose processor (GPP) are demanded high performance in dataintensive applications, such as deep learning, high performance computation (HPC), where algorithm kernels like GEMM (general matrix-matrix multiply) and SPMV (sparse matrix-vector multiply) kernels are intensively used. The performance of these data-intensive applications are bounded with memory bandwidth, which is limited by computing & memory access coupling and memory wall effect. Recent works proposed streaming ISA extensions to maximum memory bandwidth, which decouple computation and memory access, prefetching data by memory access pattern, hiding architecture latency. However, the performance of irregular memory access still suffers from low bus utilization when transferring narrow stream elements on wide memory buses. To solve this problem, the project proposes a new on-chip bus protocol -AXI-PACK, extended from Advance eXtensible Interface4 (AXI4) on-chip protocol, which enables high bandwidth end-to-end irregular memory streaming. Next, an on-chip multi-banked SRAM memory system is designed for supporting AXI-PACK, and AXI-PACK is evaluated under an open-source RISC-V vector processor system. AXI-PACK demonstrates high bus utilization and bandwidth in irregular access, which helps speedup GEMM(element size = 32bits) kernel 6.1 times and SpMV(element size = 32bits) kernel 3.0 times under bus data width of 256 bits, comparing to standard AXI4 bus.

show abstract

Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra

Cited by 7 publications

References 34 publications

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Flare: Flexible In-Network Allreduce

AXI-Pack: Near-Memory Bus Packing for Bandwidth-Efficient Irregular Workloads

Contact Info

Product

Resources

About