The SARC architecture is composed of multiple processor types and a set of user-managed direct memory access (DMA) engines that let the runtime scheduler overlap data transfer and computation. The runtime system automatically allocates tasks on the heterogeneous cores and schedules the data transfers through the DMA engines. SARC's programming model supports various highly parallel applications, with matching support from specialized accelerator processors. On-chip parallel computation shows great promise for scaling raw processing performance within a given power budget. However, chip multiprocessors (CMPs) often struggle with programmability and scalability issues such as cache coherency and off-chip memory bandwidth and latency.
Abstract-Previous vector architectures divided the available register file space in a fixed number of registers of equal sizes and shapes. We propose a register file organization which allows dynamic creation of a variable number of multidimensional registers of arbitrary sizes referred to as a Polymorphic Register File. Our objective is to evaluate the performance benefits of the proposed organization. Simulation results using real applications (Floyd and CG) suggest speedups of up to 3 times compared to the Cell SPU for Floyd and 2 times compared to a one dimensional vectorized version of the sparse matrix vector multiplication. Moreover, in the same experimental context, a large reduction in the number of executed instructions of up to 3000 times for Floyd and 2000 times for sparse matrix vector multiplication is achieved.
Abstract-We study the scalability of multi-lane 2D Polymorphic Register Files (PRFs) in terms of clock cycle time, chip area and power consumption. We assume an implementation which stores data in a 2D array of linearly addressable memory banks, and consider one single-view and four suitable multi-view parallel access schemes which cover all basic access patterns commonly used in scientific and multimedia applications. The PRF design features 2 read and 1 write ports, targeting the TSMC 90nm ASIC technology. We consider three PRF sizes -32KB, 128KB and 512KB and four multi-lane configurations -8 / 16 / 32 and 64 lanes. Synthesis results suggest that the clock frequency varies between 500MHz for a 512KB PRF with 64 vector lanes and 970Mhz for a 32KB / 8-lanes case. Estimated power consumption ranges from less than 300mW (dynamic) and 10mW (leakage) for our 8-lane, 32KB PRF up to 8.7W (dynamic) and 276mW (leakage) for a 512KB with 64 lanes. We also show the correlation among the storage capacity, the number of lanes, and the chip overall area. Furthermore, we also investigated customized addressing functions. Our experimental results suggest up to 21% increase of the clock frequency, and up to 39% combinational hardware area reduction (nearly 10% of the total area) compared to our straightforward implementations. Concerning power, we reduce dynamic power with up to 31% and leakage with nearly 24%.
Abstract-This paper studies the implementability of performance efficient multi-lane Polymorphic Register Files (PRFs). Our PRF implementation uses a 2D array of p × q linearly addressable memory banks, with customized addressing functions to avoid address routing circuits. We target one single-view and a set of four non redundant multi-view parallel memory schemes that cover all widely used access patterns in scientific and multimedia applications: 1) p × q rectangle, p · q row, p · q main and secondary diagonals; 2) p × q rectangle, p · q column, p · q main and secondary diagonals; 3) p · q row, p · q column, aligned p × q rectangle; 4) p × q, q × p rectangles (transposition). Reconfigurable hardware was chosen for the implementation due to its potential in enhancing the PRF runtime adaptability. For a proof of concept, we prototyped a 2 read, 1 write ports PRF on a Virtex-7 XC7VX1140T-2 FPGA. We consider four sizes for the 16 lanes PRFs -16 × 16, 32 × 32, 64 × 64 and 128 × 128 and three multi-lane configurations, 8, 16 and 32, for the 128 × 128 PRF. Synthesis results suggest clock frequencies between 111 MHz and 326 MHz while utilizing less than 10% of the available LUTs. By using customized addressing functions, the LUT usage is reduced by up to 29% and the clock frequency is up to 77% higher compared to a straight-forward implementation.
Abstract. We evaluate the scalability of a Polymorphic Register File using the Conjugate Gradient method as a case study. We focus on a heterogeneous multi-processor architecture, taking into consideration critical parameters such as cache bandwidth and memory latency. We compare the performance of 256 Polymorphic Register File-augmented workers against a single Cell PowerPC Processor Unit (PPU). In such a scenario, simulation results suggest that for the Sparse Matrix Vector Multiplication kernel, absolute speedups of up to 200 times can be obtained. Moreover, when equal number of workers in the range 1-256 is employed, our design is between 1.7 and 4.2 times faster than a Cell PPU-based system. Furthermore, we study the memory latency and cache bandwidth impact on the sustainable speedups of the system considered. Our tests suggest that a 128 worker configuration requires the caches to deliver 1638.4 GB/sec in order to preserve 80% of its peak speedup.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.