Any modern Graphics Processing Unit (graphics card) is a good platform to run massively parallel programs. Still, we lack tools to observe and measure performance characteristics of GPU-based software. We state that due to complex memory hierarchy and thou- sands of execution threads the all performance issues are about efficient use of graphics card memory hierarchy. We propose to use GPGPUSim simulator, previously used mostly for graphics card architecture validation, for performance validation for CUDA-based program. We provide examples which show how to use the simulation for performance analysis of massively parallel programs.
The paper considers concurrent program modeling using resource constrained automatons. Several software samples are considered: real time operational systems, video processing including object recognition, neural network inference, common linear systems solving methods for physical processes modeling. The source code annotating and automatic extraction of program resource constraints with the help of profiling software are considered, this enables the modeling for concurrent software behavior with minimal user assistance.
Modern parallel programs run in a complex, resource-limited environment, and this raises the new requirements for resource consumption and execution stability of long running processes. In order to help with checking resource constraints for such parallel software a resource-limited parallel program formal model was developed. The model expresses the resource and time constraints and is suitable both for fine grained and coarse-grained parallelism in programs. For higher degrees of parallelism (at independent procedure level, bigger loop iterations, large computing blocks for graphics, video and neural network processing) the interpretation of formal model can be done in run-time and avoid dead locks and hangs during resource allocation. We are discussing several modern software frameworks that are able to integrate the functionality to interpret the model and check the feasibility of the set of parallel programs running on hardware simultaneously with resource and time limitations. Real world tasksneural network inference, video processing, general purpose computing on GPUwhich get benefits after enabling such models-are discussed.
Modern workloads, parallel or sequential, usually suffer from insufficient memory and computing performance. Common trends to improve workload performance include the utilizations of complex functional units or coprocessors, which are able not only to provide accelerated computations but also independently fetch data from memory generating complex address patterns, with or without support of control flow operations. Such coprocessors usually are not adopted by optimizing compilers and should be utilized by special application interfaces by hand. On the other hand, memory bottlenecks may be avoided with proper use of processor prefetch capabilities which load necessary data ahead of actual utilization time, and the prefetch is also adopted only for simple cases making programmers to do it usually by hand. As workloads are fast migrating to embedded applications a problem raises how to utilize all hardware capabilities for speeding up workload at moderate efforts. This requires precise analysis of memory access patterns at program run time and marking hot spots where the vast amount of memory accesses is issued. Precise memory access model can be analyzed via simulators, for example Valgrind, which is capable to run really big workload, for example neural network inference in reasonable time. But simulators and hardware performance analyzers fail to separate the full amount of memory references and cache misses per particular modules as it requires the analysis of program call graph. We are extending Valgrind tool cache simulator, which allows to account memory accesses per software modules and render realistic distribution of hot spot in a program. Additionally the analysis of address sequences in the simulator allows to recover array access patterns and propose effective prefetching schemes. Motivating samples are provided to illustrate the use of Valgrind tool.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.