Reduction Operator for Wide-SIMDs Reconsidered

Waeijen, Luc; She, Dongrui; Corporaal, Henk; He, Yifan

doi:10.1145/2593069.2593198

Cited by 2 publications

(4 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, partial histogram merging, row projection (sub kernels in the FFoS application described in Section 3.2), Find Maximal Element in a Vector, and Sum of Vector Elements (categorized as global-to-point kernels in Section 3.2). To efficiently handle these kernels on a wide SIMD with only a circular neighborhood network, we introduced two novel reduction algorithms, pipelined reduction and diagonal access reduction, which do not rely on complex communication networks or any dedicated hardware [29]. The key idea of both approaches is to utilize inter-vector parallelism instead of intra-vector parallelism.…”

Section: Left Pementioning

confidence: 99%

“…The experimental results show that using the proposed algorithms, the performance is comparable to the performance when dedicated reduction hardware is equipped. For details please refer to the work of L.Waeijen et al [29].…”

Section: Left Pementioning

confidence: 99%

“…This is because kernels with only local communication can be efficiently mapped onto such a design, while kernels with global access will spend a significant amount of cycles on data transfers between PEs that are far apart. To reduce the overall cost of long-distance communication, we introduced two algorithms, pipelined reduction and diago- Table 4 Benchmark kernels and the categories they belong to nal access reduction, which do not rely on complex communication networks or any dedicated hardware [29]. The key idea of both approaches is to utilize inter-vector parallelism instead of intra-vector parallelism, which can be applied to both global-to-point and global-to-global kernels.…”

Section: Benchmarksmentioning

confidence: 99%

“…Therefore, when the array becomes larger, there is no communication penalty such as the ones for the max and reduction kernels. As mentioned in the previous sections, to reduce the overall cost of long-distance communication, we introduced two algorithms that exploit inter-vector parallelism instead of intra-vector parallelism [29]. These approaches can also be applied to the global-to-global kernels.…”

Section: Simd Vs Riscmentioning

confidence: 99%

See 3 more Smart Citations

A Low-Energy Wide SIMD Architecture with Explicit Datapath

Waeijen

She

Corporaal

et al. 2014

J Sign Process Syst

Self Cite

View full text Add to dashboard Cite

Energy efficiency has become one of the most important topics in computing. To meet the ever increasing demands of the mobile market, the next generation of processors will have to deliver a high compute performance at an extremely limited energy budget. Wide single instruction, multiple data (SIMD) architectures provide a promising solution, as they have the potential to achieve high compute performance at a low energy cost. We propose a configurable wide SIMD architecture that utilizes explicit datapath techniques to further optimize energy efficiency without sacrificing computational performance. To demonstrate the efficiency of the proposed architecture, multiple instantiations of the proposed wide SIMD architecture and its automatic bypassing counterpart, as well as a baseline RISC processor, are implemented. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve an average of 206 times speed-up and reduces the total energy dissipation by 48.3% on average and up to 94%, compared to a reduced instruction set computing (RISC) processor. Compared to the corresponding SIMD architecture with automatic bypassing, an average of 64% of all register file accesses is avoided by the 128-PE, explicitly bypassed SIMD. For total energy dissipation, an average of 27.5%, and maximum of 43.0%, reduction is achieved.

show abstract

Section: Left Pementioning

confidence: 99%

Section: Left Pementioning

confidence: 99%

Section: Benchmarksmentioning

confidence: 99%

Section: Simd Vs Riscmentioning

confidence: 99%

See 2 more Smart Citations

A Low-Energy Wide SIMD Architecture with Explicit Datapath

Waeijen

She

Corporaal

et al. 2014

J Sign Process Syst

Self Cite

View full text Add to dashboard Cite

show abstract

R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA

de Bruin,

Vadivel,

Wijtvliet

et al. 2024

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

Emerging data-driven applications in the embedded, e-Health, and internet of things (IoT) domain require complex on-device signal analysis and data reduction to maximize energy efficiency on these energy-constrained devices. Coarse-grained reconfigurable architectures (CGRAs) have been proposed as a good compromise between flexibility and energy efficiency for ultra-low power (ULP) signal processing. Existing CGRAs are often specialized and domain-specific or can only accelerate simple kernels, which makes accelerating complete applications on a CGRA while maintaining high energy efficiency an open issue. Moreover, the lack of instruction set architecture (ISA) standardization across CGRAs makes code generation using current compiler technology a major challenge. This work introduces R-Blocks; a ULP CGRA with HW/SW co-design tool-flow based on the OpenASIP toolset. This CGRA is extremely flexible due to its well-established VLIW-SIMD execution model and support for flexible SIMD-processing, while maintaining an extremely high energy efficiency using software bypassing, optimized instruction delivery, and local scratchpad memories. R-Blocks is synthesized in a commercial 22-nm FD-SOI technology and achieves a full-system energy efficiency of 115 MOPS/mW on a common FFT benchmark, 1.45 × higher than a highly tuned embedded RISC-V processor. Comparable energy efficiency is obtained on multiple complex workloads, making R-Blocks a promising acceleration target for general-purpose computing.

show abstract

Reduction Operator for Wide-SIMDs Reconsidered

Cited by 2 publications

References 4 publications

A Low-Energy Wide SIMD Architecture with Explicit Datapath

A Low-Energy Wide SIMD Architecture with Explicit Datapath

R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA

Contact Info

Product

Resources

About