At the Locus of Performance: A Case Study in Enhancing CPUs with Copious 3D-Stacked Cache

Domke, Jens; Vatai, Emil; Gerofi, Balazs; Kodama, Y.; Wahib, Mohamed; Podobas, Artur; Mittal, Sparsh; Pericàs, Miquel À.; Zhang, Lingqi; Chen, Peng; Drozd, Aleksandr; Matsuoka, Satoshi

doi:10.48550/arxiv.2204.02235

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ARM's scalable vector extension (SVE) is a promising alternative for incorporating micro-architecture-independent vector instructions to general purpose processors. Between a processor's L1 and L2 caches, there are already instances where the communication is 1024-bit wide, such as the ARMbased A64FX supercomputer processor [10].…”

Section: Introductionmentioning

confidence: 99%

Efficiently Removing Sparsity for High-Throughput Stream Processing

Papaphilippou,

Que,

Luk

2023

2023 International Conference on Field Programmable Technology (ICFPT)

View full text Add to dashboard Cite

Big data analytics and machine learning are increasingly targeted by FPGAs due to their significant amount of computing capabilities and internal parallelism. Different programming models are used to distribute the workload to the internals of the FPGAs at different granularities. While the memory bandwidth has been steadily increasing, there are some challenges in the way system-on-chips use this bandwidth. One way system-on-chip architects exploit the increasing memory bandwidth is by widening the datapath width. This is reflected at various points in the system including the widening of vector instructions. On FPGAs, many analytics accelerators are memory-bound, and would benefit from making the most of the available bandwidth. In this paper we present a scalable and highly-efficient building block for building high-throughput streaming accelerators, which removes sparsity on-the-fly without backpressure.

show abstract

Section: Introductionmentioning

confidence: 99%

Efficiently Removing Sparsity for High-Throughput Stream Processing

Papaphilippou,

Que,

Luk

2023

2023 International Conference on Field Programmable Technology (ICFPT)

View full text Add to dashboard Cite

show abstract

“…As an answer to the phase-out of Moore's Law [12] and Dennard's scaling [7], computer architects must strive for improved scalability and energy efficiency to propel performance scaling in the post-Moore era [22]. This challenge has led to an architectural shift from exploiting high Instruction Level Parallelism (ILP) towards the exploitation of on-chip Multiple Instruction, Multiple Data (MIMD) parallelism [8].…”

Section: Introductionmentioning

confidence: 99%

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

Cavalcante¹,

Wüthrich²,

Perotti³

et al. 2022

Preprint

View full text Add to dashboard Cite

While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include microarchitectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz' performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256 × 256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.

show abstract

At the Locus of Performance: A Case Study in Enhancing CPUs with Copious 3D-Stacked Cache

Cited by 2 publications

References 0 publications

Efficiently Removing Sparsity for High-Throughput Stream Processing

Efficiently Removing Sparsity for High-Throughput Stream Processing

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

Contact Info

Product

Resources

About