Daniel Sunderland scite author profile

h i g h l i g h t s• We developed a performance portable programming model (PM) for manycore devices.• Unifying parallel dispatch and data layout is mandatory for performance portability.• The Kokkos C++library implements this PM with pthreads, OpenMP, and CUDA back-ends. • Demonstrate Xeon Phi and NVIDIA GPU performance portability with mini-applications.• Recommend a strategy for legacy application codes to migrate to manycore. a b s t r a c tThe manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos' abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.

show abstract

Manycore Performance-Portability: Kokkos Multidimensional Array Library

Edwards

Sunderland

Porter

et al. 2012

Scientific Programming

View full text Add to dashboard Cite

Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel kernels and (3) multidimensional arrays. Kernel execution performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices – potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by (1) separating data access patterns from computational kernels through a multidimensional array API and (2) introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].

show abstract

Radiative Heat Transfer Calculation on 16384 GPUs Using a Reverse Monte Carlo Ray Tracing Approach with Adaptive Mesh Refinement

Humphrey

Sunderland

Harman

et al. 2016

View full text Add to dashboard Cite

HOMMEXX 1.0: a performance-portable atmospheric dynamical core for the Energy Exascale Earth System Model

et al. 2019

View full text Add to dashboard Cite

Abstract. We present an architecture-portable and performant implementation of the atmospheric dynamical core (High-Order Methods Modeling Environment, HOMME) of the Energy Exascale Earth System Model (E3SM). The original Fortran implementation is highly performant and scalable on conventional architectures using the Message Passing Interface (MPI) and Open MultiProcessor (OpenMP) programming models. We rewrite the model in C++ and use the Kokkos library to express on-node parallelism in a largely architecture-independent implementation. Kokkos provides an abstraction of a compute node or device, layout-polymorphic multidimensional arrays, and parallel execution constructs. The new implementation achieves the same or better performance on conventional multicore computers and is portable to GPUs. We present performance data for the original and new implementations on multiple platforms, on up to 5400 compute nodes, and study several aspects of the single- and multi-node performance characteristics of the new implementation on conventional CPU (e.g., Intel Xeon), many core CPU (e.g., Intel Xeon Phi Knights Landing), and Nvidia V100 GPU.

show abstract

Kokkos Array performance-portable manycore programming model

Edwards

Sunderland

2012

View full text Add to dashboard Cite

Large, complex scientific and engineering application code have a significant investment in computational kernels which implement their mathematical models. Porting these computational kernels to multicore-CPU and manycore-accelerator (e.g., NVIDIA R GPU) devices is a major challenge given the diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach for implementing computational kernels that are performance-portable to multicore-CPU and manycore-accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel computational kernels, and (3) multidimensional arrays. Performance-portability is achieved by decoupling computational kernels from device-specific data access performance requirements (e.g., NVIDIA coalesced memory access) through an intuitive multidimensional array API. The Kokkos Array API uses C++ template meta-programming to, at compile time, transparently insert device-optimal data access maps into computational kernels. With this programming model computational kernels can be written once and, without modification, performance-portably compiled to multicore-CPU and manycore-accelerator devices.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.