We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication, optimized for implementation on high-end FPGAs. It forms the kernel in many important tile-based BLAS algorithms, making an excellent candidate for acceleration. The designs, both based on the rank-1 update scheme, can handle arbitrary matrix sizes, and are able to sustain their peak performance except during an initial latency period. Through these designs, the trade-offs involved in terms of local-memory and bandwidth for an FPGA implementation are demonstrated and an analysis is presented for the optimal choice of design parameters. The designs, implemented on a Virtex-5 SX240T FPGA, scale gracefully from 1 to 40 processing elements(PEs) with a less than 1% degradation in the design frequency of 373 MHz. With 40 PEs and a design speed of 373 MHz, a sustained performance of 29.8 GFLOPS is possible with a bandwidth requirement of 750 MB/s for design-II and 5.9 GB/s for design-I. This compares favourably with both related art and general purpose CPU implementations.
A number of computations exist, especially in area of error-control coding and matrix computations, whose underlying data flow graphs are based on finite projective-geometry (PG) based balanced bipartite graphs. Many of these applications of finite projective geometry are actively being researched upon, especially in coding theory. Almost all these applications need large bipartite graphs, whose nodes represent parallel computations. To reduce its implementation cost, reducing amount of system/hardware resources during design is an important engineering objective. In this context, we present a scheme to reduce resource utilization while designing systems modeled using PG-based graphs. In such systems, the number of processing units is equal to the number of vertices, each performing an atomic computation. We present a novel way of partitioning the vertex set assigned to various atomic computations, into blocks. Each block of partition is then assigned to a processing unit. A processing unit performs the computations corresponding to the vertices in the block assigned to it in a sequential fashion, thus creating the effect of folding the overall computation. The symmetric properties of projective space lattices enable us to develop a conflict-free communication schedule. We employed the technique of coset decomposition of a finite field for partitioning. The folding scheme achieves the best possible throughput, in lack of any overhead of shuffling data across memories while scheduling another computation on the same processing unit. We first provide a scheme for a finite projective space of dimension five, and the corresponding schedules. This specific scheme is then generalized for arbitrary finite projective spaces. Both the folding schemes have been verified by both simulation as well as hardware prototyping. For example, a semi-parallel decoder architecture for a new class of expander codes was designed and implemented using this scheme, with potential deployment in DVD-R/CD-ROM drives.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.