Being "memory-centric", the single-chip distributed logic-memory (DLM) architecture can significantly improve the overall performance and energy efficiency of many memoryintensive embedded applications, especially those that exhibit irregular array data access patterns at algorithmic level. However, implementing DLM architecture poses unique challenges to an FPGA designer in terms of 1) organizing and partitioning diverse on-chip memory resources, and 2) orchestrating effective data transmission between on-chip and off-chip memory. In this paper, we offer our solutions to both of these challenges. Specifically, 1) we propose a stochastic memory partitioning scheme based on the well-known simulated annealing algorithm. It obtains memory partitioning solutions that promote parallelized memory accesses by exploring large solution space; 2) we augment the proposed DLM architecture with a reconfigure hardware graph that can dynamically compute precedence relationship between memory partitions, thus effectively exploiting algorithmic level memory parallelism on a per-application basis.We evaluate the effectiveness of our approach (A3) against two other DLM architecture synthesizing methods: an algorithmiccentric reconfigurable computing architectures with a single monolithic memory (A1) and the heterogeneous distributed architectures synthesized according to [1] (A2). To make our comparison fair, in all three architectures, the data path remains the same while local memory architecture differs. For each of ten benchmark applications from SPEC2006 and MiBench [2], we break down the performance benefit of using A3 into two parts: the portion due to stochastic local memory partitioning and the portion due to the dynamic graph-based memory arbitration. All experiments have been conducted with a Virtex-5 (XCV5LX155T-2) FPGA. On average, our experimental results show that our proposed A3 architecture outperforms A2 and A1 by 34% and 250%, respectively. Within the performance improvement of A3 over A2, more than 70% improvement comes from the hardware graph-based memory scheduling.
General sparse matrix-matrix multiplication (SpGEMM) is an integral part of many scientific computing, high-performance computing (HPC), and graph analytic applications. This paper presents a new compressed sparse vector (CSV) format for representing sparse matrices and FSpGEMM, an OpenCL-based HPC framework for accelerating general sparse matrix-matrix multiplication on FPGAs. The proposed FSpGEMM framework includes an FPGA kernel implementing a throughput-optimized hardware architecture based on Gustavson's algorithm and a host program implementing pre-processing functions for converting input matrices to the CSV format tailored for the proposed architecture. FSpGEMM utilizes a new buffering scheme tailored to Gustavson's algorithm. We compare FSpGEMM implemented on an Intel Arria 10 GX FPGA development board with Intel Math Kernel Library (MKL) implemented on an Intel Xeon E5-2637 CPU and cuSPARSE on an NVIDIA GTX TITAN X GPU, respectively, for multiplying a set of sparse matrices selected from SuiteSparse Matrix Collection. The experiment results show that the proposed FSpGEMM solution achieves on average 4.9× and 1.7× higher performance with 31.9× and 13.1× lower energy consumption per SpGEMM computation than the CPU and GPU implementations, respectively.
In this paper, we present FLASH 1.0, a C++-based software framework for rapid parallel deployment and enhancing host code portability in heterogeneous computing. FLASH takes a novel approach in describing kernels and dynamically dispatching them in a hardwareagnostic manner. FLASH features truly hardware-agnostic frontend interfaces, which not only unify the compile-time control flow but also enforces a portability-optimized code organization that imposes a demarcation between computational (performance-critical) and functional (non-performance-critical) codes as well as the separation of hardware-specific and hardware-agnostic codes in the host application. We use static code analysis to measure the hardware independence ratio of popular HPC applications and show that up to 99.72% code portability can be achieved with FLASH. Similarly, we measure the complexity of state-of-the-art portable programming models and show that a code reduction of up to 2.2x can be achieved for two common HPC kernels while maintaining 100% code portability with a normalized framework overhead between 1% -13% of the total kernel runtime. The codes are available at https://github.com/PSCLab-ASU/FLASH.
CCS CONCEPTS• Software and its engineering → Object oriented frameworks; Software as a service orchestration system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.