The performance of programs executed on heterogeneous parallel platforms largely depends on the design choices regarding how to partition the processing on the various different processing units. In other words, it depends on the assumptions and parameters that define the partitioning, mapping, scheduling, and allocation of data exchanges among the various processing elements of the platform executing the program. The advantage of programs written in languages using the dataflow model of computation (MoC) is that executing the program with different configurations and parameter settings does not require rewriting the application software for each configuration setting, but only requires generating a new synthesis of the execution code corresponding to different parameters. The synthesis stage of dataflow programs is usually supported by automatic code generation tools. Another competitive advantage of dataflow software methodologies is that they are well-suited to support designs on heterogeneous parallel systems as they are inherently free of memory access contention issues and naturally expose the available intrinsic parallelism. So as to fully exploit these advantages and to be able to efficiently search the configuration space to find the design points that better satisfy the desired design constraints, it is necessary to develop tools and associated methodologies capable of evaluating the performance of different configurations and to drive the search for good design configurations, according to the desired performance criteria. The number of possible design assumptions and associated parameter settings is usually so large (i.e., the dimensions and size of the design space) that intuition as well as trial and error are clearly unfeasible, inefficient approaches. This paper describes a method for the clock-accurate profiling of software applications developed using the dataflow programming paradigm such as the formal RVL-CAL language. The profiling can be applied when the application program has been compiled and executed on GPU/CPU heterogeneous hardware platforms utilizing two main methodologies, denoted as static and dynamic. This paper also describes how a method for the qualitative evaluation of the performance of such programs as a function of the supplied configuration parameters can be successfully applied to heterogeneous platforms. The technique was illustrated using two different application software examples and several design points.
Developing and fine-tuning software programs for heterogeneous hardware such as CPU/GPU processing platforms comprise a highly complex endeavor that demands considerable time and effort of software engineers and requires evaluating various fundamental components and features of both the design and of the platform to maximize the overall performance. The dataflow programming approach has proven to be an appropriate methodology for reaching such a difficult and complex goal for the intrinsic portability and the possibility of easily decomposing a network of actors on different processing units of the heterogeneous hardware. Nonetheless, such a design method might not be enough on its own to achieve the desired performance goals, and supporting tools are useful to be able to efficiently explore the design space so as to optimize the desired performance objectives. This article presents a methodology composed of several stages for enhancing the performance of dataflow software developed in RVC-CAL and generating low-level implementations to be executed on GPU/CPU heterogeneous hardware platforms. The stages are composed of a method for the efficient scheduling of parallel CUDA partitions, an optimization of the performance of the data transmission tasks across computing kernels, and the exploitation of dynamic programming for introducing SIMD-capable graphics processing unit systems. The methodology is validated on both the quantitative and qualitative side by means of dataflow software application examples running on platforms according to various different mapping configurations.
This paper presents new optimization approaches aiming at reducing the impact of memory accesses on the performance of dataflow programs. The approach is based on introducing a high level management of composite data types in dynamic dataflow programming language for the memory processing of data tokens. It does not require essential changes to the model of computation (MOC) or to the dataflow program itself. The objective of the approach is to remove the unnecessary constraints of memory isolations without introducing limitations to the scalability and composability properties of the dataflow paradigm. Thus the identified optimizations allow to keep the same design and programming philosophy of dataflow, whereas aiming at improving the performance of the specific configuration implementation. The different optimizations can be integrated into the current RVC-CAL design flows and synthesis tools and can be applied to different sub-networks partitions of the dataflow program. The paper introduces the context, the definition of the optimization problem and describes how it can be applied to dataflow designs. Some examples of the optimizations are provided.
The possibility of using the increasing computing power available in cloud infrastructures requires the development of new approaches for application software development and optimization. Emerging edge computing paradigms offer the possibility of reducing bandwidth needs and of optimizing latency, features particularly relevant for Big Data applications, by bringing computation closer to the user and to the data generation processes. However, edge computing approaches pose several challenges in terms of how to be able to efficiently take advantage of a distributed network of heterogeneous processing nodes. This paper deals with this problem by extending a dynamic dataflow software development framework and related design flow tools to support heterogeneous platforms. The paper describes the methodology steps for the synthesis of application software executing on heterogeneous CPU/GPU co-processing nodes. The steps do include the optimization of the communication between heterogeneous processing elements, a technique for the efficient mapping and parallelization of computation on independent GPU partitions, and the introduction of dynamic programming approach for leveraging the SIMD nature of GPU computing. To complete the methodology of seamless porting of dataflow software and partition on CPU or GPU computing nodes, an automated methodology for exploring the configuration space and to identify high performance working points is developed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.