Scalability Evaluation of a Polymorphic Register File: A CG Case Study

Kuzmanov

2012 15th Euromicro Conference on Digital System Design

2012

Abstract-We study the scalability of multi-lane 2D Polymorphic Register Files (PRFs) in terms of clock cycle time, chip area and power consumption. We assume an implementation which stores data in a 2D array of linearly addressable memory banks, and consider one single-view and four suitable multi-view parallel access schemes which cover all basic access patterns commonly used in scientific and multimedia applications. The PRF design features 2 read and 1 write ports, targeting the TSMC 90nm ASIC technology. We consider three PRF sizes -32KB, 128KB and 512KB and four multi-lane configurations -8 / 16 / 32 and 64 lanes. Synthesis results suggest that the clock frequency varies between 500MHz for a 512KB PRF with 64 vector lanes and 970Mhz for a 32KB / 8-lanes case. Estimated power consumption ranges from less than 300mW (dynamic) and 10mW (leakage) for our 8-lane, 32KB PRF up to 8.7W (dynamic) and 276mW (leakage) for a 512KB with 64 lanes. We also show the correlation among the storage capacity, the number of lanes, and the chip overall area. Furthermore, we also investigated customized addressing functions. Our experimental results suggest up to 21% increase of the clock frequency, and up to 39% combinational hardware area reduction (nearly 10% of the total area) compared to our straightforward implementations. Concerning power, we reduce dynamic power with up to 31% and leakage with nearly 24%.

Section: Introductionmentioning

confidence: 97%

Section: Introductionmentioning

confidence: 99%

Scalability Study of Polymorphic Register Files

Kuzmanov

2012 15th Euromicro Conference on Digital System Design

2012

“…A CG case study evaluated the PRF based system scalability in a heterogeneous multi-core architecture and showed CG acceleration by two orders of magnitude using up to 256 PRF cores, with 32 vector lanes each. Moreover, a similar performance level could be achieved by fewer PRF cores compared to a Cell BE-based system, potentially saving area and power [6].…”

Section: Background and Related Workmentioning

confidence: 99%

“…Previous studies ( [5], [16]) have shown that such PRFs are suitable for computationally intensive workloads such as Floyd, the Conjugate Gradient (CG) Method and dense matrix multiplication. It was also suggested that PRFs can improve the performance efficiency in state of the art many-core computers, potentially saving area and power [6]. More specifically, the potential benefits from using a 2D PRF are: i) improved storage efficiency, as the number of registers, their dimensions and sizes are customized to the workload requirements, and ii) performance gain, as the committed instructions number is greatly reduced.…”

Section: Introductionmentioning

confidence: 99%

On implementability of Polymorphic Register Files

Kuzmanov

7th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC)

2012

Abstract-This paper studies the implementability of performance efficient multi-lane Polymorphic Register Files (PRFs). Our PRF implementation uses a 2D array of p × q linearly addressable memory banks, with customized addressing functions to avoid address routing circuits. We target one single-view and a set of four non redundant multi-view parallel memory schemes that cover all widely used access patterns in scientific and multimedia applications: 1) p × q rectangle, p · q row, p · q main and secondary diagonals; 2) p × q rectangle, p · q column, p · q main and secondary diagonals; 3) p · q row, p · q column, aligned p × q rectangle; 4) p × q, q × p rectangles (transposition). Reconfigurable hardware was chosen for the implementation due to its potential in enhancing the PRF runtime adaptability. For a proof of concept, we prototyped a 2 read, 1 write ports PRF on a Virtex-7 XC7VX1140T-2 FPGA. We consider four sizes for the 16 lanes PRFs -16 × 16, 32 × 32, 64 × 64 and 128 × 128 and three multi-lane configurations, 8, 16 and 32, for the 128 × 128 PRF. Synthesis results suggest clock frequencies between 111 MHz and 326 MHz while utilizing less than 10% of the available LUTs. By using customized addressing functions, the LUT usage is reduced by up to 29% and the clock frequency is up to 77% higher compared to a straight-forward implementation.

“…Compared to the Cell CPU, PRFs decrease the number of instructions for a customized, high performance dense matrix multiplication by up to 35X [7] and improve performance for Floyd and sparse matrix vector multiplication [8]. A Conjugate Gradient case study evaluated the scalability of up to 256 PRF-based accelerators in a heterogeneous multi-core architecture, with two orders of magnitude performance improvements [11]. Furthermore, potential power and area savings were shown by employing fewer PRF cores compared to a system with Cell processors.…”

Section: Introductionmentioning

confidence: 99%

The Case for Polymorphic Registers in Dataflow Computing

Pilato

et al. 2017

Int J Parallel Prog

Heterogeneous systems are becoming increasingly popular, delivering high performance through hardware specialization. However, sequential data accesses may have a negative impact on performance. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high-speed, parallel access to performance-critical data. This article shows how PRFs can be integrated into dataflow computational platforms. Our semi-automatic, compiler-based methodology generates customized PRFs and modifies the computational kernels to efficiently exploit them. We use a separable 2D convolution case study to evaluate the impact of memory latency and bandwidth on performance compared to a state-of-theart NVIDIA Tesla C2050 GPU. We improve the throughput up to 56.17X and show that the PRF-augmented system outperforms the GPU for 9 × 9 or larger mask sizes, even in bandwidth-constrained systems.