We present the design and implementation of a universal, single-bitstream library for accelerating matrixvector multiplication using FPGAs. Our library handles multiple matrix encodings ranging from dense to multiple sparse formats. A key novelty in our approach is the introduction of a hardware-optimized sparse matrix representation called Compressed Variable-Length Bit Vector (CVBV), which reduces the storage and bandwidth requirements up to 43% (on average 25%) compared to compressed sparse row (CSR) across all the matrices from the University of Florida Sparse Matrix Collection. Our hardware incorporates a runtimeprogrammable decoder that performs on-the-fly decoding of various formats such as Dense, COO, CSR, DIA, and ELL. The flexibility and scalability of our design is demonstrated across two FPGA platforms: (1) the BEE3 (Virtex-5 LX155T with 16GB of DRAM) and (2) ML605 (Virtex-6 LX240T with 2GB of DRAM). For dense matrices, our approach scales to large data sets with over 1 billion elements, and achieves robust performance independent of the matrix aspect ratio. For sparse matrices, our approach using a compressed representation reduces the overall bandwidth while also achieving comparable efficiency relative to state-of-the-art approaches.
We present an FPGA accelerator for the Nonuniform Fast Fourier Transform, which is a technique to reconstruct images from arbitrarily sampled data. We accelerate the compute-intensive interpolation step of the NuFFT Gridding algorithm by implementing it on an FPGA. In order to ensure efficient memory performance, we present a novel FPGA implementation for Geometric Tiling based sorting of the arbitrary samples. The convolution is then performed by a novel Data Translation architecture which is composed of a multi-port local memory, dynamic coordinate-generator and a plug-and-play kernel pipeline. Our implementation is in single-precision floating point and has been ported onto the BEE3 platform. Experimental results show that our FPGA implementation can generate fairly high performance without sacrificing flexibility for various data-sizes and kernel functions. We demonstrate up to 8X speedup and up to 27 times higher performance-per-watt over a comparable CPU implementation and up to 20% higher performance-per-watt when compared to a relevant GPU implementation.
Applications based on Discrete Fourier Transforms (DFT) are extensively used in several areas of signal and digital image processing. Of particular interest is the two-dimensional (2D) DFT which is more computation-and bandwidth-intensive than the one-dimensional (1D) DFT. Traditionally, a 2D DFT is computed using Row-Column (RC) decomposition, where 1D DFTs are computed along the rows followed by 1D DFTs along the columns. Both application This paper is an extension of our paper that appeared in SIPS '09. The added sections are: (1) Impact of large data size on conventional 2D DFT architecture (Section 2.2);(2) Detailed descriptions of the infrastructure components of the FPGA platform (Section 4.2); (3) Detailed description of the automatic 2D DFT system generator (Section 5); (4) Accuracy analysis of the 2D DFT (Section 6.4). specific and reconfigurable hardware have utilized this scheme for high-performance implementations of 2D DFT. However, architectures based on RC decomposition are not efficient for large input size data due to memory bandwidth constraints. In this paper, we propose an efficient architecture to implement 2D DFT for large-sized input data based on a novel 2D decomposition algorithm. This architecture achieves very high throughput by exploiting the inherent parallelism due to the algorithm decomposition and by utilizing the row-wise burst access pattern of the external memory. A high throughput memory interface has been designed to enable maximum utilization of the memory bandwidth. In addition, an automatic system generator is provided for mapping this architecture onto a reconfigurable platform of Xilinx Virtex-5 devices. For a 2K × 2K input size, the proposed architecture is 1.96 times faster than RC decomposition based implementation under the same memory constraints, and also outperforms other existing implementations.
Video and image content has begun to play a growing role in many applications, ranging from video games to autonomous self-driving vehicles. In this paper, we present accelerators for gist-based scene recognition, saliency-based attention, and HMAX-based object recognition that have multiple uses and are based on the current understanding of the vision systems found in the visual cortex of the mammalian brain. By integrating them into a two-level hierarchical system, we improve recognition accuracy and reduce computational time.Results of our accelerator prototype on a multi-FPGA system show real-time performance and high recognition accuracy with large speedups over existing CPU, GPU and FPGA implementations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.