Abstract-Parallel machines are becoming more complex with increasing core counts and more heterogeneous architectures. However, the commonly used parallel programming models, C/C++ with MPI and/or OpenMP, make it difficult to write source code that is easily tuned for many targets. Newer language approaches attempt to ease this burden by providing optimization features such as automatic load balancing, overlap of computation and communication, message-driven execution, and implicit data layout optimizations. In this paper, we compare several implementations of LULESH, a proxy application for shock hydrodynamics, to determine strengths and weaknesses of different programming models for parallel computation. We focus on four traditional (OpenMP, MPI, MPI+OpenMP, CUDA) and four emerging (Chapel, Charm++, Liszt, Loci) programming models. In evaluating these models, we focus on programmer productivity, performance and ease of applying optimizations.
Abstract-The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.
Current and planned computer systems present challenges for scientific programming. Memory capacity and bandwidth are limiting performance as floating point capability increases due to more cores per processor and wider vector units. Effectively using hardware requires finding greater parallelism in programs while using relatively less memory. In this poster, we present how we tuned the Livermore Unstructured Lagrange Explicit Shock Hydrodynamics proxy application for on-node performance resulting in 62% fewer memory reads, a 19% smaller memory footprint, 770% more floating point operations vectorizing and less than 0.1% serial section runtime. Tests show serial code version runtime decreases of up to 57% and parallel runtime reductions of up to 75%. We are also applying these optimizations to GPUs and a subset of ALE3D, from which the proxy application was derived. So far we achieve up to a 1.9x speedup on GPUs, and a 13% runtime reduction in the application for the same problem.
I. INTRODUCTIONHydrodynamics is widely used to model continuum material properties and material interactions in the presence of applied forces. It can consume up to one third the runtime of these applications. To provide a simpler, but still full-featured problem to test various tuning techniques and different programming models the Livermore Unstructured Lagrange Explicit Shock Hydro (LULESH) mini-app was created as one of five challenge problems in the DARPA UHPC program [1]. LULESH solves the sedov problem by modeling one octant of a symmetrical blast wave.We are using LULESH to test optimization techniques and programming practices that increase the performance of code on current and future architectures. By using a mini-app we can quickly explore and evaluate techniques that hold promise before making the more extensive changes needed in production codes. We focus on increasing hardware parallelism utilization, reducing memory traffic and decreasing memory footprint. Optimizations target on-node memory bandwidth, memory footprint and parallelism, because architectural trends are resulting in machines with less memory per core, less relative bandwidth and more on-node parallelism.We applied six optimizations to LULESH: loop fusion, array contraction, data layout changes, increased vectorization, NUMA aware allocation and allocation of temporaries outside the timestep loop. These changes reduced last level cache misses by over 62%, the global state size of the program by 19%, serial section to less than 0.1% of the overall runtime,
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.