The UPC programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory sub-systems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can come at the cost of substantial performance penalties. This is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a fine-grained pattern. In this paper we study performance enhancement strategies specifically targeting such fine-grained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs that originally require fine-grained irregular communication. Besides the performance enhancement strategies, the main contribution of the present paper is to propose performance models for the different scenarios, in form of quantifiable formulas that hinge on the actual volumes of various data movements plus a small number of easily obtainable hardware characteristic parameters. These performance models help to verify the enhancements obtained, while also providing insightful predictions of similar parallel implementations, not limited to UPC, that also involve between-thread or between-process irregular communication. As a further validation, we also apply our performance modeling methodology and hardware characteristic parameters to an existing UPC code for solving a 2D heat equation on a uniform mesh.
MotivationGood programmer productivity and high computational performance are usually two conflicting goals in the context of developing parallel code for scientific computations. Partitioned global address space (PGAS) [10,2,11,22], however, is a parallel programming model that aims to achieve both goals at the same time. The fundamental mechanism of PGAS is a global address space that is conceptually shared among concurrent processes that jointly execute a parallel program. Data exchange between the processes is carried out by a low-level network layer "under the hood" without explicit involvement from the programmer, thus providing good productivity. The shared global address space is logically partitioned such that each partition has affinity to a designated owner process. This awareness of data locality is essential for achieving good performance of parallel programs written in the PGAS model, because the globally shared address space may actually encompass many physically distributed memory sub-systems.Unified Parallel C (UPC) [13,28] is an extension of the C language and provides the PGAS parallel programming model. The concurrent execution processes of UPC are termed as threads, which execute a UPC program in the style of single-program-multiple-...