Synthesis-friendly techniques for tightly-coupled integration of hardware accelerators into shared-memory multi-core clusters

Conti, Francesco; Marongiu, Andrea; Benini, Luca

doi:10.1109/codes-isss.2013.6658992

Cited by 12 publications

(10 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While this strategy can result in orders-of-magnitude power reductions, it is also inflexible, as each block can perform a single function. These characteristics are even more pre-eminent when accelerators are shared by multiple cores [36], [37], because requests for accelerated functions must be arbitrated.…”

Section: State-of-the-artmentioning

confidence: 99%

HEAL-WEAR: An Ultra-Low Power Heterogeneous System for Bio-Signal Analysis

Duch

Basu

Braojos

et al. 2017

IEEE Trans. Circuits Syst. I

View full text Add to dashboard Cite

Abstract-Personalized healthcare devices enable low-cost, unobtrusive and long-term acquisition of clinically-relevant biosignals. These appliances, termed Wireless Body Sensor Nodes (WBSNs), are fostering a revolution in health monitoring for patients affected by chronic ailments. Nowadays, WBSNs often embed complex digital processing routines, which must be performed within an extremely tight energy budget. Addressing this challenge, in this paper we introduce a novel computing architecture devoted to the ultra-low power analysis of biosignals. Its heterogeneous structure comprises multiple processors interfaced with a shared acceleration resource, implemented as a Coarse-Grained Reconfigurable Array (CGRA). The CGRA mesh effectively supports the execution of the intensive loops that characterize bio-signal analysis applications, while requiring a low reconfiguration overhead. Moreover, both the processors and the reconfigurable fabric feature Single-Instruction / MultipleData (SIMD) execution modes, which increase efficiency when multiple data streams are concurrently processed. The run-time behavior on the system is orchestrated by a light-weight hardware mechanism, which concurrently synchronizes processors for SIMD execution and regulates access to the reconfigurable accelerator. By jointly leveraging run-time reconfiguration and SIMD execution, the illustrated heterogeneous system achieves, when executing complex bio-signal analysis applications, speedups of up to 11.3x on the considered kernels and up to 37.2% overall energy savings, with respect to an ultra-low power multicore platform which does not feature CGRA acceleration.

show abstract

Section: State-of-the-artmentioning

confidence: 99%

HEAL-WEAR: An Ultra-Low Power Heterogeneous System for Bio-Signal Analysis

Duch

Basu

Braojos

et al. 2017

IEEE Trans. Circuits Syst. I

View full text Add to dashboard Cite

show abstract

“…Cong et al [10] also tackle the utilization wall by developing a heterogeneous multi-core architecture with shared-memory accelerators; their HW IPs communicate by means of shared L2 caches, accessible through NoC nodes. Previous work by our group (Burgio et al [8], Dehyadegari et al [16,17], Conti et al [11]) considers a tightly-coupled multi-core based on RISC32 cores sharing a L1 scratchpad and extend it with hardware processing units (HWPUs). HWPUs are managed by the software through an OpenMPbased programming model designed to mix parallelization and acceleration.…”

Section: Related Workmentioning

confidence: 99%

“…Figure 1 shows a diagram of the He-P2012 cluster extended with HWPEs for heterogeneous computing. As detailed in our previous work [11], HWPEs are designed as two separate modules:…”

Section: He-p2012: Heterogeneous P2012mentioning

confidence: 99%

He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores

Conti

Marongiu

Pilkington

et al. 2015

J Sign Process Syst

Self Cite

View full text Add to dashboard Cite

The end of Dennardian scaling in advanced technologies brought about new architectural templates to overcome the so-called utilization wall and provide Moore's Law-like performance and energy scaling in embedded SoCs. One of the most promising templates, architectural heterogeneity, is hindered by high cost due to the design space explosion and the lack of effective exploration tools. Our work provides three contributions towards a scalable and effective methodology for design space exploration in embedded MC-SoCs. First, we present the He-P2012 architecture, augmenting the state-of-art STMicroelectronics P2012 platform with heterogeneous shared-L1 coprocessors called HW processing elements (HWPE). Second, we propose a novel methodology for the semi-automatic definition and instantiation of shared-memory HWPEs from a C source, supporting both simple and structured data types. Third, we demonstrate that the integration of HWPEs can provide significant performance and energy efficiency benefits on a set of benchmarks originally developed for the homogeneous P2012, achieving up to 123x speedup on the accelerated code region (∼98 % of Amdahl's law limit) while saving 2/3 of the energy.

show abstract

“…Dehyadegari [12] and Conti [13] exploit shared-memory as a communication medium between cores and accelerators. Our current and previous work [14], [15] assumes the same architecture, tackling also programmability and scalability issues.…”

Section: Related Workmentioning

confidence: 99%

A HLS-Based Toolflow to Design Next-Generation Heterogeneous Many-Core Platforms with Shared Memory

Burgio

Marongiu

Coussy

et al. 2014

2014 12th IEEE International Conference on Embedded and Ubiquitous Computing

Self Cite

View full text Add to dashboard Cite

This work describes how we use High-Level Synthesis to support design space exploration (DSE) of heterogeneous many-core systems. Modern embedded systems increasingly couple hardware accelerators and processing cores on the same chip, to trade specialization of the platform to an application domain for increased performance and energy efficiency. However, the process of designing such a platform is complex and error-prone, and requires skills on algorithmic aspects, ardware synthesis, and software engineering. DSE can partially be automated, and thus simplified, by coupling the use of HLS tools and virtual prototyping platforms. In this paper we enable the design space exploration of heterogeneous many-cores adopting a shared-memory architecture template, where communication and synchronization between the hardware accelerators and the cores happens through L1 shared memory. This communication infrastructure leverages a "zero-copy" scheme, which simplifies both the design process of the platform and the development of applications on top of it. Moreover, the shared-memory template perfectly fits the semantics of several high-level programming models, such as OpenMP. We provide programmers with simple yet powerful abstractions to exploit accelerators from within an OpenMP application, and propose a low-cost implementation of the necessary runtime support. An HLS-based automatic design flow is set up, to quickly explore the design space using a cycleaccurate virtual platform.

show abstract

Synthesis-friendly techniques for tightly-coupled integration of hardware accelerators into shared-memory multi-core clusters

Cited by 12 publications

References 21 publications

HEAL-WEAR: An Ultra-Low Power Heterogeneous System for Bio-Signal Analysis

HEAL-WEAR: An Ultra-Low Power Heterogeneous System for Bio-Signal Analysis

He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores

A HLS-Based Toolflow to Design Next-Generation Heterogeneous Many-Core Platforms with Shared Memory

Contact Info

Product

Resources

About