Optimizing HPC Applications with Intel® Cluster Tools

Dahnken, Christopher; Semin, Andrey; Supalov, Alexander; Klemm, Michael

doi:10.1007/978-1-4302-6497-2

Cited by 17 publications

(12 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first one is strictly sequential and performs x j ← ax j−1 +c, while the second one performs v j ← x j /m and can be vectorized using new SIMD extensions like AVX and AVX-512 which are available in modern multicore and manycore processors [9,27]. It can be enforced by placing the pragma simd before each loop [27]. To optimize memory access the array, v should be allocated using the _mm_malloc() intrinsic.…”

Section: Performance Analysismentioning

confidence: 99%

See 1 more Smart Citation

Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures

Stpiczyński

2017

J Supercomput

View full text Add to dashboard Cite

The aim of this paper is to show that the multidimensional Monte Carlo integration can be efficiently implemented on computers with modern multicore CPUs and manycore accelerators including Intel MIC and GPU architectures using a new vectorized version of LCG pseudorandom number generator which requires limited amount of memory. We introduce two new implementations of the algorithm based on directive-based parallel programming standards OpenMP and OpenACC and consider their performance using Hockney-Jesshope theoretical model of vector computations. We also present and discuss the results of experiments performed on dual-processor Intel Xeon E5-2670 computers with Intel Xeon Phi 7120P and NVIDIA K40m.

show abstract

Section: Performance Analysismentioning

confidence: 99%

“…To optimize memory access the array, v should be allocated using the _mm_malloc() intrinsic. It works just like the malloc function and additionally allows data alignment [27]. This loop has limited length (i.…”

Section: Performance Analysismentioning

confidence: 99%

Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures

Stpiczyński

2017

J Supercomput

View full text Add to dashboard Cite

show abstract

“…Recently, multicore and manycore computer architectures have become very attractive for achieving high-performance execution of scientific applications at relatively low costs [5,13,17]. Modern CPUs and accelerators achieve performance that was recently reached by supercomputers.…”

Section: Introductionmentioning

confidence: 99%

“…Intel C/C++ compilers and development tools offer many language-based extensions that can be used to simplify the process of developing high-performance parallel programs [6,17]. OpenMP [3,18] is the most popular, but one can consider using Threading Building Blocks (TBB for short) [6,12] or Cilk Plus [5,13].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Language-based vectorization and parallelization using intrinsics, OpenMP, TBB and Cilk Plus

Stpiczyński

2018

J Supercomput

View full text Add to dashboard Cite

The aim of this paper is to evaluate OpenMP, TBB and Cilk Plus as basic language-based tools for simple and efficient parallelization of recursively defined computational problems and other problems that need both task and data parallelization techniques. We show how to use these models of parallel programming to transform a source code of Adaptive Simpson's Integration to programs that can utilize multiple cores of modern processors. Using the example of Belman-Ford algorithm for solving single-source shortest path problems, we advise how to improve performance of data parallel algorithms by tuning data structures for better utilization of vector extensions of modern processors. Manual vectorization techniques based on Cilk array notation and intrinsics are presented. We also show how to simplify such optimization using Intel SIMD Data Layout Template containers.

show abstract

Memory allocation anomalies in high‐performance computing applications: A study with numerical simulations

Gomes

Molion

Souto

et al. 2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary A memory allocation anomaly occurs when the allocation of a set of heap blocks imposes an unnecessary overhead on the execution of an application. This overhead is particularly disturbing for high‐performance computing (HPC) applications running on shared resources—for example, numerical simulations running on clusters or clouds—because it may increase either the execution time of the application (contributing to a reduction on the overall efficiency of the shared resource) or its memory consumption (eventually inhibiting its capacity to handle larger problems). In this article, we propose a method for identifying, locating, characterizing and fixing allocation anomalies, and a tool for developers to apply the method. We experiment our method and tool with a numerical simulator aimed at approximating the solutions to partial differential equations using a finite element method. We show that taming allocation anomalies in this simulator reduces both its execution time and the memory footprint of its processes, irrespective of the specific heap allocator being employed with it. We conclude that the developer of HPC applications can benefit from the method and tool during the software development cycle.

show abstract

Optimizing HPC Applications with Intel® Cluster Tools

Cited by 17 publications

References 3 publications

Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures

Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures

Language-based vectorization and parallelization using intrinsics, OpenMP, TBB and Cilk Plus

Memory allocation anomalies in high‐performance computing applications: A study with numerical simulations

Contact Info

Product

Resources

About