2014
DOI: 10.1007/978-1-4302-6497-2
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing HPC Applications with Intel® Cluster Tools

Abstract: Here and elsewhere, certain product names may be the property of their respective third parties.xxii ■ IntroduCtIon Use Interprocedural OptimizationAdd the compiler flag -ipo to switch on interprocedural optimization. This will give the compiler a holistic view of the program and open more optimization opportunities for the program as a whole. Note that this will also increase the overall compilation time.Runtime profiling can also increase the chances for the compiler to generate better code. Profile-guided o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
12
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 17 publications
(12 citation statements)
references
References 3 publications
0
12
0
Order By: Relevance
“…The first one is strictly sequential and performs x j ← ax j−1 +c, while the second one performs v j ← x j /m and can be vectorized using new SIMD extensions like AVX and AVX-512 which are available in modern multicore and manycore processors [9,27]. It can be enforced by placing the pragma simd before each loop [27]. To optimize memory access the array, v should be allocated using the _mm_malloc() intrinsic.…”
Section: Performance Analysismentioning
confidence: 99%
See 1 more Smart Citation
“…The first one is strictly sequential and performs x j ← ax j−1 +c, while the second one performs v j ← x j /m and can be vectorized using new SIMD extensions like AVX and AVX-512 which are available in modern multicore and manycore processors [9,27]. It can be enforced by placing the pragma simd before each loop [27]. To optimize memory access the array, v should be allocated using the _mm_malloc() intrinsic.…”
Section: Performance Analysismentioning
confidence: 99%
“…To optimize memory access the array, v should be allocated using the _mm_malloc() intrinsic. It works just like the malloc function and additionally allows data alignment [27]. This loop has limited length (i.…”
Section: Performance Analysismentioning
confidence: 99%
“…Recently, multicore and manycore computer architectures have become very attractive for achieving high-performance execution of scientific applications at relatively low costs [5,13,17]. Modern CPUs and accelerators achieve performance that was recently reached by supercomputers.…”
Section: Introductionmentioning
confidence: 99%
“…Intel C/C++ compilers and development tools offer many language-based extensions that can be used to simplify the process of developing high-performance parallel programs [6,17]. OpenMP [3,18] is the most popular, but one can consider using Threading Building Blocks (TBB for short) [6,12] or Cilk Plus [5,13].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation