Rakesh Krishnaiyer scite author profile

Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming many-core architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads. We first quantify the extent of the " Ninja gap ", which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X ) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X . These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

show abstract

A Library-Based Approach to Task Parallelism in a Data-Parallel Language

Foster

Kohr

Krishnaiyer

et al. 1997

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

The data-parallel language High Performance Fortran (HPF) does not allow e cient expression of mixed task/data-parallel computations or the coupling of separately compiled data-parallel modules. In this paper, we show how these common parallel program structures can be represented, with only minor extensions to the HPF model, by using a coordination library based on the Message Passing Interface (MPI). This library allows data-parallel tasks to exchange distributed data structures using calls to simple communication functions. We present microbenchmark results that characterize the performance of this library and that quantify the impact of optimizations that allow reuse of communication schedules in common situations. In addition, results from two-dimensional FFT, convolution, and multiblock programs demonstrate that the HPF/MPI library can provide performance superior to that of pure HPF. We conclude that this synergistic combination of two parallel programming standards represents a useful approach to task parallelism in a data-parallel framework, increasing the range of problems addressable in HPF without requiring complex compiler technology.

show abstract

Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor

Krishnaiyer

Kültürsay

Chawla

et al. 2013

View full text Add to dashboard Cite

Optimizing software data prefetches with rotating registers

Doshi¹,

Krishnaiyer²,

Muthukumar³

View full text Add to dashboard Cite

show abstract

Value-Profile Guided Stride Prefetching for Irregular Code

Serrano

Krishnaiyer

et al. 2002

View full text Add to dashboard Cite

Memory operations in irregular code are difficult to prefetch, as the future address of a memory location is hard to anticipate by a compiler. However, recent studies as well as our experience indicate that many irregular programs contain loads with near-constant strides. This paper presents a novel compiler technique to profile and prefetch for those loads. The profile captures not only the dominant stride values for each profiled load, but also the differences between the successive strides of the load. The profile information helps the compiler to classify load instructions into strongly or weakly strided and singlestrided or phased multi-strided. The prefetching decisions guided by the load classifications are highly selective and beneficial. We obtain significant performance improvement for the CPU2000 integer programs running on Itanium machines. For example, we achieve a 1.55x speedup for "181.mcf", 1.15x for "254.gap", 1.08x for "197.parser" and smaller gains in other benchmarks. We also show that the performance gain is stable across profile data sets and that the profiling overhead is low. These benefits make the new technique suitable for a production compiler.

show abstract

Integrating High-Level Optimizations in a Production Compiler: Design and Implementation Experience

Ghosh

Kanhere

Krishnaiyer

et al. 2003

View full text Add to dashboard Cite

The High-Level Optimizer (HLO) is a key part of the compiler technology that enabled Itanium TM and Itanium TM 2 processors deliver leading floating-point performance at their introduction. In this paper, we discuss the design and implementation experience in integrating diverse optimizations in the HLO module. In particular, we describe decisions made in the design of HLO targeting Itanium processor family. We provide empirical data to validate the design decisions. Since HLO was implemented in a production compiler, we made certain engineering trade-offs. We discuss these trade-offs and outline key learning derived from our experience.

show abstract

Remote I/O

Foster

Kohr

Krishnaiyer

et al. 1997

View full text Add to dashboard Cite

show abstract

Can traditional programming bridge the ninja performance gap for parallel computing applications?

et al. 2015

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.