Programming Heterogeneous CPU-GPU Systems by High-Level Dataflow Synthesis

2022

JLPEA

Self Cite

The performance of programs executed on heterogeneous parallel platforms largely depends on the design choices regarding how to partition the processing on the various different processing units. In other words, it depends on the assumptions and parameters that define the partitioning, mapping, scheduling, and allocation of data exchanges among the various processing elements of the platform executing the program. The advantage of programs written in languages using the dataflow model of computation (MoC) is that executing the program with different configurations and parameter settings does not require rewriting the application software for each configuration setting, but only requires generating a new synthesis of the execution code corresponding to different parameters. The synthesis stage of dataflow programs is usually supported by automatic code generation tools. Another competitive advantage of dataflow software methodologies is that they are well-suited to support designs on heterogeneous parallel systems as they are inherently free of memory access contention issues and naturally expose the available intrinsic parallelism. So as to fully exploit these advantages and to be able to efficiently search the configuration space to find the design points that better satisfy the desired design constraints, it is necessary to develop tools and associated methodologies capable of evaluating the performance of different configurations and to drive the search for good design configurations, according to the desired performance criteria. The number of possible design assumptions and associated parameter settings is usually so large (i.e., the dimensions and size of the design space) that intuition as well as trial and error are clearly unfeasible, inefficient approaches. This paper describes a method for the clock-accurate profiling of software applications developed using the dataflow programming paradigm such as the formal RVL-CAL language. The profiling can be applied when the application program has been compiled and executed on GPU/CPU heterogeneous hardware platforms utilizing two main methodologies, denoted as static and dynamic. This paper also describes how a method for the qualitative evaluation of the performance of such programs as a function of the supplied configuration parameters can be successfully applied to heterogeneous platforms. The technique was illustrated using two different application software examples and several design points.

Section: Compile 4 Executementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms by Dynamic Network Execution

2022

JLPEA

Self Cite

“…The tool flow is presented in Figure 2. The high-level representation of the application program written in RVC-CAL, together with configuration files providing partitioning and buffer sizes information, are fed to the ORCC compiler, which uses the Exelixi CUDA backend [48], [49] to generate the C++/CUDA code that is then compiled with the Nvidia CUDA Compiler (NVCC) to obtain an executable of the heterogeneous program. Using a platform-specific compiler as the last layer of the tool-chain allows the methodology to be compatible with all Nvidia supported platforms (i.e., X86(_64), ARM, POWER9, and all Nvidia GPUs).…”

Section: B Partition and Mappingmentioning

confidence: 99%

“…Regarding performance, Table 1 summarizes two different sets of results. The first one is when the idct2d actor runs on the GPU sequentially (this corresponds to the methodology presented in [48]), all other actors are running on the CPU. The second one corresponds to the improved methodology where the idct2d actor runs in parallel on the GPU.…”

Section: ) Rvc-cal Jpeg Decodermentioning

confidence: 99%

Methodologies for Synthesizing and Analyzing Dynamic Dataflow Programs in Heterogeneous Systems for Edge Computing

IEEE Open J. Circuits Syst.

2021

Self Cite

The possibility of using the increasing computing power available in cloud infrastructures requires the development of new approaches for application software development and optimization. Emerging edge computing paradigms offer the possibility of reducing bandwidth needs and of optimizing latency, features particularly relevant for Big Data applications, by bringing computation closer to the user and to the data generation processes. However, edge computing approaches pose several challenges in terms of how to be able to efficiently take advantage of a distributed network of heterogeneous processing nodes. This paper deals with this problem by extending a dynamic dataflow software development framework and related design flow tools to support heterogeneous platforms. The paper describes the methodology steps for the synthesis of application software executing on heterogeneous CPU/GPU co-processing nodes. The steps do include the optimization of the communication between heterogeneous processing elements, a technique for the efficient mapping and parallelization of computation on independent GPU partitions, and the introduction of dynamic programming approach for leveraging the SIMD nature of GPU computing. To complete the methodology of seamless porting of dataflow software and partition on CPU or GPU computing nodes, an automated methodology for exploring the configuration space and to identify high performance working points is developed.

“…The second is the extension of both the design space exploration model defined by the authors of this work and the extension of the open-source toolbox capable of synthesizing low-level code for heterogeneous CPU and GPU platforms. To this end, the methodology already defined in [8] was significantly extended allowing automatically synthesizing a C++/CUDA parallel version for every actors' actions, all taking full advantage of SIMD parallelization techniques. All the innovative contributions of this article can be summarized as follows:…”

Section: Introductionmentioning

confidence: 99%

Dynamic SIMD Parallel Execution on GPU from High-Level Dataflow Synthesis

2022

JLPEA

Self Cite

Developing and fine-tuning software programs for heterogeneous hardware such as CPU/GPU processing platforms comprise a highly complex endeavor that demands considerable time and effort of software engineers and requires evaluating various fundamental components and features of both the design and of the platform to maximize the overall performance. The dataflow programming approach has proven to be an appropriate methodology for reaching such a difficult and complex goal for the intrinsic portability and the possibility of easily decomposing a network of actors on different processing units of the heterogeneous hardware. Nonetheless, such a design method might not be enough on its own to achieve the desired performance goals, and supporting tools are useful to be able to efficiently explore the design space so as to optimize the desired performance objectives. This article presents a methodology composed of several stages for enhancing the performance of dataflow software developed in RVC-CAL and generating low-level implementations to be executed on GPU/CPU heterogeneous hardware platforms. The stages are composed of a method for the efficient scheduling of parallel CUDA partitions, an optimization of the performance of the data transmission tasks across computing kernels, and the exploitation of dynamic programming for introducing SIMD-capable graphics processing unit systems. The methodology is validated on both the quantitative and qualitative side by means of dataflow software application examples running on platforms according to various different mapping configurations.