Compiled multithreaded data paths on FPGAs for dynamic workloads

Halstead, Robert J.; Najjar, Walid

doi:10.1109/cases.2013.6662507

Cited by 16 publications

(6 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…7, the proposed accelerator can obtain higher performance for most of the test matrices, compared with the implementations on the Convey HC2ex platform with four Virtex-6 LX760 FPGAs [13], HC-1 [12] and Tesla S1070 [7]. With the number of the nonzero block in one block row and the density of one increasing, the performance improvement can be higher.…”

Section: Performance Comparisonmentioning

confidence: 96%

“…K. Nagar, et al [12] implemented SpMV for large-scale sparse matrices on the Convey HC-1 with a novel streaming multiply-accumulator and local vector cache. Further, A hardware multithreaded implementation of SpMV on the Convey HC2ex, which makes use of multiple outstanding memory requests to mask the long latencies and multiple Computation Engines to process multiple rows in parallel [13]. However, the performance improvement of the above two implementations mainly depend on the high bandwidth and multiple memory controllers, which are greatly excessive of other platforms.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme

Guo

Dou

Lei

et al. 2015

IEICE Electron. Express

View full text Add to dashboard Cite

This paper presents a high performance sparse matrix-vector multiplication (SpMV) accelerator on the field-programming gate array (FPGA). By exploiting a hardware-friendly storage scheme, named as Variable-Bit-Width Coordinate Block Quasi Compressed Sparse Row, the redundant computation and memory accesses can be reduced greatly through the nested block compression and variable-bit-width column-index encoding schemes. Based on the proposed compression scheme, a deeply-pipelined SpMV accelerator is implemented on a Xilinx Virtex XC7VX485T FPGA platform, which can handle sparse matrices with arbitrary size and sparsity pattern. Experimental results show that the proposed design can gain higher performance for most of the tested matrices and improve the utilization of the memory bandwidth up to 13×, compared with the previous works on the Convey platforms (HC-1 and HC-2ex) and Nvidia Tesla S1070 GPU platform.

show abstract

Section: Performance Comparisonmentioning

confidence: 96%

Section: Related Workmentioning

confidence: 99%

A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme

Guo

Dou

Lei

et al. 2015

IEICE Electron. Express

View full text Add to dashboard Cite

show abstract

“…Synthesis of multithreaded accelerators. Halstead and Najjar extend the ROCCC HLS compiler with the CHAT methodology to generate temporally multithreaded accelerators starting from loops constructs [23]. However, they do not address atomic memory operations and focus on the simple case study of pointer chasing.…”

Section: Related Workmentioning

confidence: 99%

Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph Analytics

Minutoli

Castellana²,

Saporetti³

et al. 2022

IEEE Trans. Comput.

View full text Add to dashboard Cite

Graph analytics are an emerging class of irregular applications. Operating on very large datasets, they present unique behaviors, such as fine-grained, unpredictable memory accesses, and highly unbalanced task level parallelism, that make existing high-performance general-purpose processors or accelerators (e.g., GPUs) suboptimal. To address these issues, research and industry are developing a variety of custom accelerator designs for this application area, including solutions based on reconfigurable devices (Field Programmable Gate Arrays). These new approaches often employ High-Level Synthesis (HLS) to accelerate the development of the accelerators. In this paper, we propose a novel architecture template for the automatic generation of accelerators for graph analytics and irregular applications. The architecture template includes a dynamic task scheduling mechanism, a parallel array of accelerators that enables supporting task-level parallelism with context switching, and a related multi-channel memory interface that decouples communication from computation and provides support for fine-grained atomic memory operations. We discuss the integration of the architectural template in an HLS flow, presenting the necessary modifications to enable automatic generation of the custom architectures starting from OpenMP annotated code. We evaluate our approach first by synthesizing and exploring triangle counting, a common graph algorithm, and then by synthesizing custom designs for a set of graph database benchmark queries, representing series of graph pattern matching routines. We compare the synthesized accelerators with previous state-of-the-art methodologies for the synthesis of parallel architectures, showing that the proposed approach allows reducing resource usage by optimizing the number of accelerators replicas without any performance penalty.

show abstract

“…Another idea is to maximize the utilization of a single hardware accelerator, extending its functionality to support hardware threads and hide latencies in pipelined loops. Halstead and Najjar extend the ROCCC HLS compiler to generate multi-threaded accelerators starting from loops constructs [22]. The programming model is similar to OpenMP for loops, and the generated architecture uses hardware context-switches to hide variable latencies due to memory accesses in irregular applications.…”

Section: High-level Synthesis Of Multi-threaded Programsmentioning

confidence: 99%

Automated Bug Detection for High-level Synthesis of Multi-threaded Irregular Applications

Fezzardi

Ferrandi

2020

ACM Trans. Parallel Comput.

View full text Add to dashboard Cite

Field Programmable Gate Arrays (FPGAs) are becoming an appealing technology in datacenters and High Performance Computing. High-Level Synthesis (HLS) of multi-threaded parallel programs is increasingly used to extract parallelism. Despite great leaps forward in HLS and related debugging methodologies, there is a lack of contributions in automated bug identification for HLS of multi-threaded programs. This work defines a methodology to automatically detect and isolate bugs in parallel circuits generated with HLS. The technique relies on hardware/software Discrepancy Analysis and exploits a pattern-matching algorithm based on Finite State Automata to compare multiple hardware and software threads. Overhead, advantages, and limitations are evaluated on designs generated with an open-source HLS compiler supporting OpenMP.

show abstract

Compiled multithreaded data paths on FPGAs for dynamic workloads

Cited by 16 publications

References 21 publications

A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme

A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme

Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph Analytics

Automated Bug Detection for High-level Synthesis of Multi-threaded Irregular Applications

Contact Info

Product

Resources

About