Generating Code and Memory Buffers to Reorganize Data on Many-core Architectures

2018 IEEE High Performance Extreme Computing Conference (HPEC)

et al. 2018

OpenVX is a standard proposed by the Khronos group for cross-platform acceleration of computer vision and deep learning applications. OpenVX abstracts the target processor architecture complexity and automates the implementation of processing pipelines through high-level optimizations. While highly efficient OpenVX implementations exist for shared memory multi-core processors, targeting OpenVX to clustered manycore processors appears challenging. Indeed, such processors comprise multiple compute units or clusters, each fitted with an on-chip local memory shared by several cores. This paper describes an efficient implementation of OpenVX that targets clustered manycore processors. We propose a framework that includes computation graph analysis, kernel fusion techniques, RDMA-based tiling into local memories, optimization passes, and a distributed execution runtime. This framework is implemented and evaluated on the 2nd-generation Kalray MPPA R clustered manycore processor. Experimental results show that super-linear speed-ups are obtained for multi-cluster execution by leveraging the bandwidth of on-chip memories and the capabilities of asynchronous RDMA engines.

Section: Runtime Optimization Rdma-based Tiling and Fusionmentioning

confidence: 99%

A Distributed Framework for Low-Latency OpenVX over the RDMA NoC of a Clustered Manycore

Hascoe

Dinechin

2018 IEEE High Performance Extreme Computing Conference (HPEC)

et al. 2018

“…Indeed, by using contiguous memory spaces, the developer of an application avoids the multiple jumps in memory that would have a negative impact on the system performance. By doing so, the developer also avoids writing complex pointer operations that would decrease the source code readability [Cudennec et al 2014].…”

Section: Graymentioning

confidence: 99%

“…In [Cudennec et al 2014], a technique is proposed to enable buffer merging for a set of actors with pre-defined behavior. In contrast to the method presented in this article, this technique does not allow buffer merging for actors with a user-defined behavior.…”

Section: Dataflow Optimizationsmentioning

confidence: 99%

On Memory Reuse Between Inputs and Outputs of Dataflow Actors

ACM Trans. Embed. Comput. Syst.

Pelcat

Nezan

et al. 2016

This article introduces a new technique to minimize the memory footprints of Digital Signal Processing (DSP) applications specified with Synchronous Dataflow (SDF) graphs and implemented on shared-memory Multiprocessor Systems-on-Chips (MPSoCs). In addition to the SDF specification, which captures data dependencies between coarse-grained tasks called actors, the proposed technique relies on two optional inputs abstracting the internal data dependencies of actors: annotations of the ports of actors, and script-based specifications of merging opportunities between input and output buffers of actors. Experimental results on a set of applications show a reduction of the memory footprint by 48% compared to state-of-the-art minimization techniques.

“…In [5], a technique is proposed to enable buffer merging for a set of actors with pre-defined behavior. Contrary to the method presented in this paper, this technique does not allow buffer merging for actors with a user-defined behavior.…”

Section: Related Workmentioning

confidence: 99%

“…SDF actors are considered as "black boxes" within the model whose internal behavior can be implemented in any programming language. To simplify the description of this internal behavior, it is convenient to assume that the memory consumed and produced on each FIFO during the firing of an actor constitutes a contiguous memory space called a buffer [5]. To reveal these buffers, an SDF graph can be transformed into an equivalent single-rate graph where each FIFO is replaced with single-rate FIFOs whose consumption and production rates are equal ( Figure 2).…”

Section: Introductionmentioning

confidence: 99%

Buffer merging technique for minimizing memory footprints of Synchronous Dataflow specifications

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pelcat

Nezan

et al. 2015

This paper introduces and assesses a new technique to minimize the memory footprints of Digital Signal Processing (DSP) applications specified with Synchronous Dataflow (SDF) graphs and implemented on shared-memory Multiprocessor Systems-on-Chips (MP-SoCs). In addition to the SDF specification, which captures data dependencies between coarse-grained tasks called actors, the proposed technique relies on two optional inputs abstracting the internal data dependencies of actors: annotations of the ports of SDF actors, and script-based specifications of merging opportunities between input and output buffers of actors. An automated optimization process is used to exploit these buffer merging opportunities and to minimize the memory footprints of applications. Experimental results on a computer vision application show a reduction of the memory footprint by 34% compared to state-of-the-art minimization techniques.