Although General Purpose computation on Graphics Processing Units (GPGPU) is widely used for the high-performance computing, standard programming frameworks such as CUDA and OpenCL are still difficult to use. They require low-level specifications and the handoptimization is a large burden. Therefore we are developing an easier framework named MESI-CUDA. Based on a virtual shared memory model, MESI-CUDA hides low-level memory management and data transfer from the user. The compiler generates low-level code and also optimizes memory accesses applying conventional hand-optimizing techniques. However, creating GPU threads is same as CUDA; the user specifies thread mapping, i.e. thread indexing and the size of thread blocks run on each streaming multiprocessors (SM). The mapping largely affects the execution performance and may obstruct automatic optimization of MESI-CUDA compiler. Therefore, the user must find optimal specification considering physical parameters. In this paper, we propose a new thread mapping scheme. We introduce new thread creation syntax specifying hardware-independent logical mapping, which is converted into optimized physical mapping at compile time. Making static analysis of array index expressions, we obtain groups of threads accessing the same or neighboring array elements. Mapping such threads into the same thread block and assigning consecutive thread indices, the physical mapping is determined to maximize the effect of memory access optimization. As the result of evaluation, our scheme could find optimal mapping strategies for five benchmark programs. Memory access transactions were reduced to approximately 1/4 and 1.4-76 times speedup is achieved compared with the worst mapping.
Although Graphics Processing Unit (GPU) is expected to be a practical high performance computing platform, current programming frameworks such as CUDA and OpenCL require large programming cost. Therefore, we are developing a new framework MESI-CUDA providing shared variables to hide low-level data management in CUDA. However, handling dynamic data structures is difficult in current MESI-CUDA because shared variables cannot be dynamically created and pointer fields are not allowed in them. Thus, we extended MESI-CUDA to remove such restrictions. Introducing dynamic management of shared variables and automatic pointer conversion on data transfer, any pointer-based dynamic data structure can be shared between the CPU and GPU with only small changes from the C code. As the results of the evaluation, pointer conversion increased the transfer time of data structures approximately 3.3 times larger in the worst case, and 1.3-2 times larger in the practical cases. Considering that non-conversion alternatives cause overhead in pointer dereferences, we regard this overhead is practical in most cases. KEY WORDSparallel programming language, compiler, GPU, CUDA
Although General Purpose computation on Graphics Processing Units (GPGPU) is widely used for the high-performance computing, standard programming frameworks such as CUDA and OpenCL are still difficult to use. They require low-level specifications and the handoptimization is a large burden. Therefore we are developing an easier framework named MESI-CUDA. Based on a virtual shared memory model, MESI-CUDA hides low-level memory management and data transfer from the user. The compiler generates low-level code and also optimizes memory accesses applying conventional hand-optimizing techniques. However, creating GPU threads is same as CUDA; the user specifies thread mapping, i.e. thread indexing and the size of thread blocks run on each streaming multiprocessors (SM). The mapping largely affects the execution performance and may obstruct automatic optimization of MESI-CUDA compiler. Therefore, the user must find optimal specification considering physical parameters. In this paper, we propose a new thread mapping scheme. We introduce new thread creation syntax specifying hardware-independent logical mapping, which is converted into optimized physical mapping at compile time. Making static analysis of array index expressions, we obtain groups of threads accessing the same or neighboring array elements. Mapping such threads into the same thread block and assigning consecutive thread indices, the physical mapping is determined to maximize the effect of memory access optimization. As the result of evaluation, our scheme could find optimal mapping strategies for five benchmark programs. Memory access transactions were reduced to approximately 1/4 and 1.4-76 times speedup is achieved compared with the worst mapping.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.