HBM Connect: High-Performance HLS Interconnect for FPGA HBM

Choi, Y.; Chi, Yuze; Qiao, Weikang; Samardzic, Nikola; Cong, Jason

doi:10.1145/3431920.3439301

Cited by 52 publications

(25 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, their microbenchmark is developed in RTL, which still leaves a gap for software programmers who use HLS. More recently, Choi et al further proposed HBM Connect [9], a fully customized HBM crossbar to better utilize HBM bandwidth when multiple PEs access multiple HBM banks, which is orthogonal to our work. Characterization of CPU-FPGA Communication.…”

Section: Related Workmentioning

confidence: 99%

Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking

Fang

Shannon

2022

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

Both modern datacenter and embedded FPGAs provide great opportunities for high-performance and high energy-efficiency computing. With the growing public availability of FPGAs from major cloud service providers such as AWS, Alibaba, and Nimbix, as well as uniform hardware accelerator development tools (such as Xilinx Vitis and Intel oneAPI) for software programmers, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS). The major goal of this paper is to figure out how to efficiently access the memory system of modern datacenter and embedded FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including 1) the clock frequency of the accelerator design, 2) the number of concurrent memory access ports, 3) the data width of each port, 4) the maximum burst access length for each port, and 5) the size of consecutive data accesses. Then we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the memory systems of datacenter FPGAs (Xilinx Alveo U200 and U280) and embedded FPGA (Xilinx ZCU104) when changing those affecting factors, and provide insights into efficient memory access in HLS-based accelerator designs. Comparing between the typically used soft and hardened memory systems respectively found on datacenter and embedded FPGAs, we further summarize their unique features and discuss the effective approaches to leverage these systems. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms on datacenter FPGAs with a soft (and thus more flexible) memory system. Compared to the baseline designs, optimized designs leveraging our insights achieve about 3.5x and 8.5x speedups for the KNN and SpMV accelerators. Our final optimized KNN and SpMV designs on a Xilinx Alveo U200 FPGA fully utilize its off-chip memory bandwidth, and achieve about 5.6x and 3.4x speedups over the 24-core CPU implementations.

show abstract

Section: Related Workmentioning

confidence: 99%

Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking

Fang

Shannon

2022

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

show abstract

“…To enable concurrent processing, we partition all the three major components internally. We use multi-stage switch networks [34] to improve the clock frequency without sacrificing the throughput [11] when all-to-all concurrent communication is required. Besides the three major components, the SPLAG accelerator also contains a dispatcher responsible for injecting the first active vertex, controlling program termination, and collecting statistics.…”

Section: The Splag Acceleratormentioning

confidence: 99%

Accelerating SSSP for Power-Law Graphs

Chi

Guo

Cong

2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Self Cite

View full text Add to dashboard Cite

The single-source shortest path (SSSP) problem is one of the most important and well-studied graph problems widely used in many application domains, such as road navigation, neural image reconstruction, and social network analysis. Although we have known various SSSP algorithms for decades, implementing one for largescale power-law graphs efficiently is still highly challenging today, because ① a work-efficient SSSP algorithm requires priority-order traversal of graph data, ② the priority queue needs to be scalable both in throughput and capacity, and ③ priority-order traversal requires extensive random memory accesses on graph data.In this paper, we present SPLAG to accelerate SSSP for powerlaw graphs on FPGAs. SPLAG uses a coarse-grained priority queue (CGPQ) to enable high-throughput priority-order graph traversal with a large frontier. To mitigate the high-volume random accesses, SPLAG employs a customized vertex cache (CVC) to reduce off-chip memory access and improve the throughput to read and update vertex data. Experimental results on various synthetic and realworld datasets show up to a 4.9× speedup over state-of-the-art SSSP accelerators, a 2.6× speedup over 32-thread CPU running at 4.4 GHz, and a 0.9× speedup over an A100 GPU that has 4.1× power budget and 3.4× HBM bandwidth. Such a high performance would place SPLAG in the 14th position of the Graph 500 benchmark for data intensive applications (the highest using a single FPGA) with only a 45 W power budget. SPLAG is written in high-level synthesis C++ and is fully parameterized, which means it can be easily ported to various different FPGAs with different configurations. SPLAG is open-source at https://github.com/UCLA-VAST/splag. CCS CONCEPTS• Theory of computation → Shortest paths; • Computer systems organization → Reconfigurable computing; High-level language architectures.

show abstract

“…The bandwidth of each pseudo channel is 14.375 GB/s, for a total bandwidth of 460 GB/s. Because HBM is a new feature to FPGAs, existing studies of FPGA HBM mainly focus on tool development [17,18,37] and benchmarking [19,50], but very few applications. SpMM, a memory-intensive application which is distinguished from typical computation-intensive FPGA applications [5,23,31,77,87,88], is a good fit for HBM.…”

Section: High Bandwidth Memorymentioning

confidence: 99%

“…C2 -The irregular column index shown as colored square numbers in Figure 1 (b) and (c) lead to irregular memory read requests, whereas the irregular row destination of PEs in Figure 1 (c) leads to irregular memory write requests. Although our accelerators are equipped with HBM which has higher memory bandwidth, the latency of accessing HBM is still high (up to 100 cycles) [18]. Inspired by the idea of caching random accessing on a higher memory hierarchy in graph processing [70,94], we partition the random memory read and write into a specific window, so random memory accessing is limited to on-chip fast memory.…”

Section: Motivationmentioning

confidence: 99%

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

Song

Chi

Sohrabizadeh

et al. 2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Self Cite

View full text Add to dashboard Cite

Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges -(1) the random memory accessing and unbalanced load in processing because of random distribution of elements in sparse matrices, (2) inefficient data handling of the large matrices which can not be fit on-chip, and (3) a non-general-purpose accelerator design where one accelerator can only process a fixed-size problem.In this paper, we present Sextans, an accelerator for generalpurpose SpMM processing. Sextans accelerator features (1) fast random access using on-chip memory, (2) streaming access to offchip large matrices, (3) PE-aware non-zero scheduling for balanced workload with an II=1 pipeline, and (4) hardware flexibility to enable prototyping the hardware once to support SpMMs of different size as a general-purpose accelerator. We leverage high bandwidth memory (HBM) for the efficient accessing of both sparse and dense matrices. In the evaluation, we present an FPGA prototype Sextans which is executable on a Xilinx U280 HBM FPGA board and a projected prototype Sextans-P with higher bandwidth competitive to V100 and more frequency optimization. We conduct a comprehensive evaluation on 1,400 SpMMs on a wide range of sparse matrices including 50 matrices from SNAP and 150 from SuiteSparse. We compare Sextans with NVIDIA K80 and V100 GPUs. Sextans achieves a 2.50x geomean speedup over K80 GPU and Sextans-P achieves a 1.14x geomean speedup over V100 GPU (4.94x over K80). The code is available at https://github.com/linghaosong/Sextans.

show abstract

HBM Connect: High-Performance HLS Interconnect for FPGA HBM

Cited by 52 publications

References 16 publications

Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking

Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking

Accelerating SSSP for Power-Law Graphs

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

Contact Info

Product

Resources

About