Spiridon F. Beldianu scite author profile

ACM Trans. Embed. Comput. Syst.

2013

For most of the applications that make use of a dedicated vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism which often occurs due to vector-length variations in dynamic environments. The motivation of our work stems from: (a) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (b) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (c) the need to often handle a variety of vector sizes; and (d) vector kernels in application suites may have diverse computation needs. We present a robust design framework for vector coprocessor sharing in multicore environments that maximizes vector unit utilization and performance at substantially reduced energy costs. For our adaptive vector unit, which is attached to multiple cores, we propose three basic shared working policies that enforce coarse-grain, fine-grain, and vector-lane sharing. We benchmark these vector coprocessor sharing policies for a dual-core system and evaluate them using the floating-point performance, resource utilization, and power/energy consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, and LU factorization shows that these coprocessor sharing policies yield high utilization and performance with low energy costs. The proposed policies provide 1.2--2 speedups and reduce the energy needs by about 50% as compared to a system having a single core with an attached vector coprocessor. With the performance expressed in clock cycles, the sharing policies demonstrate 3.62--7.92 speedups compared to optimized Xeon runs. We also introduce performance and empirical power models that can be used by the runtime system to estimate the effectiveness of each policy in a hybrid system that can simultaneously implement this suite of shared coprocessor policies.

FPGA and ASIC square root designs for high performance and power efficiency

Suresh

2013

-Floating-point square root is a fundamental operation in signal processing and various HPC applications. Since this is an expensive operation in resource and energy consumption, its efficient implementation should be of priority in future multicores that will face dark silicon issues. This paper presents a low-cost, low-power consumption design to calculate the square root using the IEEE754 single-precision floating-point format. Two versions of the design are investigated with and without clock gating (CG), respectively. Evaluation involves FPGA and ASIC technologies at 40 and 65 nm. Substantial performance growth and reduced power consumption are gained as compared to a popular iterative solution. The ASIC design demonstrates much lower power consumption, which at 40 nm is lower than that at 65 nm by about a threefold. At 40 nm, CG for the ASIC realization is justified primarily for low activity rates.

On-chip Vector Coprocessor Sharing for Multicores

2011

For most of the applications that make use of a vector coprocessor, the resources are not highly utilized due to the lack of sustained data parallelism, which sometimes occurs due to vector-length changes in dynamic environments. The motivation of our work stems from (a) the mandate for multicore designs to make efficient use of the on-chip resources; (b) the frequent presence of vector operations in high-performance scientific and embedded applications; (c) the increased probability that different cores may deal with different vector lengths at various times; and (d) different vector kernels in the same or different application suites may have diverse computation needs. Our objective is to provide a versatile design framework that can facilitate vector coprocessor sharing among multiple cores in a manner that maximizes resource utilization while also yielding very high performance at reduced cost. We propose three basic shared vector coprocessor architectures for multicores based on coarse-grain, fine-grain and vector lane sharing. We benchmark these distinct vector architectures for a dual-core system using the floating-point performance and resource utilization metrics. Our analysis shows that vector lane sharing, where the number of vector lanes assigned to a core can be controlled dynamically, provides the greatest flexibility and generally yields very good results. Since, however, each of the three design choices has its own performance advantages under certain vector-load conditions, we ultimately suggest a hybrid vector coprocessor design that can support all three architectural choices as per the core and application collective needs.

Re-Configurable Parallel Match Evaluators Applied to Scheduling Schemes for Input-Queued Packet Switches

Rojas‐Cessa

Oki

et al. 2009

The performance of matching schemes for inputqueued (IQ) packet switches is mainly defined by the selection policy adopted. This policy can be aimed to produce a large weight sum for matched input-output pairs, where each inputoutput pair is assigned a weight, or to produce a large match size in the number of matched pairs, giving place to maximum weight matching or maximum size matching, respectively. However, schedulers can only provide a single match in function of the selection (of candidate ports) policy adopted and of the backlogged traffic at the input queues. A parallel match evaluator was recently proposed to provide not one but several match options at the same time. This approach evaluates several predefined and fixed matches and picks the match with the largest size. However, the fixed permutations of the evaluated matches may produce low performance under traffic with nonuniform distributions because of the limited number of choices. This paper proposes to make the parallel match evaluator configurable and two schemes to provide diverse and changeable matches such that the matches (and therefore, the evaluator) become adaptable to the traffic pattern. The proposed schemes were tested under uniform and nonuniform traffic patterns and the results show that these schemes provide high performance, even when scheduling is performed between periods of multiple time slots, or framed intervals. The proposed approach can be used for configuring slow micro-electro-mechanical (MEM) optical switch fabrics.Index Terms-packet switches, maximal weight matching, parallel matching, input-queued switch, iterative matching.

Performance-Energy Optimizations for Shared Vector Accelerators in Multicores

2015

IEEE Trans. Comput.