Juergen Lorenz scite author profile

First-principles simulations of high-Z metallic systems using the Qbox code on the BlueGene/L supercomputer demonstrate unprecedented performance and scaling for a quantum simulation code. Specifically designed to take advantage of massivelyparallel systems like BlueGene/L, Qbox demonstrates excellent parallel efficiency and peak performance. A sustained peak performance of 207.3 TFlop/s was measured on 65,536 nodes, corresponding to 56.5% of the theoretical full machine peak using all 128k CPUs.

show abstract

Large-Scale First-Principles Molecular Dynamics simulations on the BlueGene/L Platform using the Qbox code

Gygi

Yates

Lorenz

et al.

View full text Add to dashboard Cite

We demonstrate that the Qbox code supports unprecedented large-scale First-Principles Molecular Dynamics (FPMD) applications on the BlueGene/L supercomputer. Qbox is an FPMD implementation specifically designed for large-scale parallel platforms such as BlueGene/L. Strong scaling tests for a Materials Science application show an 86% scaling efficiency between 1024 and 32,768 CPUs. Measurements of performance by means of hardware counters show that 37% of the peak FPU performance can be attained.

show abstract

Efficient Utilization of SIMD Extensions

et al. 2005

View full text Add to dashboard Cite

Abstract-This paper targets automatic performance tuning of numerical kernels in the presence of multi-layered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel's SSE family, AMD's 3DNow!, Motorola's AltiVec, and IBM's BlueGene/L SIMD instructions.FFTW, ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize ANSI C code and feed it into the target machine's general purpose C compiler to maintain portability.The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and thus inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes (i) symbolic vectorization of DSP transforms, (ii) straight-line code vectorization for numerical kernels, and (iii) compiler backends for straight-line code with vector instructions.Methods from all three areas were combined with FFTW, SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speed-ups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.

show abstract

SIMD Vectorization of Straight Line FFT Code

Král

Franchetti

Lorenz

et al. 2003

View full text Add to dashboard Cite

Abstract. This paper presents compiler technology that targets general purpose microprocessors augmented with SIMD execution units for exploiting data level parallelism. FFT kernels are accelerated by automatically vectorizing blocks of straight line code for processors featuring two-way short vector SIMD extensions like AMD's 3DNow! and Intel's SSE 2. Additionally, a special compiler backend is introduced which is able to (i) utilize particular code properties, (ii) generate optimized address computation, and (iii) apply specialized register allocation and instruction scheduling. Experiments show that automatic SIMD vectorization can achieve performance that is comparable to the optimal hand-generated code for FFT kernels. The newly developed methods have been integrated into the codelet generator of Fftw and successfully vectorized complicated code like real-to-halfcomplex non-power-of-two FFT kernels. The floatingpoint performance of Fftw's scalar version has been more than doubled, resulting in the fastest FFT implementation to date.

show abstract

Large-Scale First-Principles Molecular Dynamics Simulations on the BlueGene/L Platform using the Qbox Code

Gygi¹,

Draeger²,

Supinski³

et al. 2006

View full text Add to dashboard Cite

show abstract

Vectorization techniques for the Blue Gene/L double FPU

Lorenz¹,

Král²,

Franchetti

et al. 2005

IBM J. Res. & Dev.

View full text Add to dashboard Cite

This paper presents vectorization techniques tailored to meet the specifics of the two-way single-instruction multiple-data (SIMD) double-precision floating-point unit (FPU), which is a core element of the node application-specific integrated circuit (ASIC) chips of the IBM 360-teraflops Blue Genet/L supercomputer. This paper focuses on the general-purpose basic-block vectorization and optimization methods as they are incorporated in the Vienna MAP vectorizer and optimizer. The innovative technologies presented here, which have consistently delivered superior performance and portability across a wide range of platforms, were carried over to prototypes of Blue Gene/L and joined with the automatic performance-tuning system known as Fastest Fourier Transform in the West (FFTW). FFTW performance-optimization facilities working with the compiler technologies presented in this paper are able to produce vectorized fast Fourier transform (FFT) codes that are tuned automatically to single Blue Gene/L processors and are up to 80% faster than the best-performing scalar FFT codes generated by FFTW.

show abstract

Comparison of results of different methods for the analysis of flux creep behavior in a melt-texturedYBa2Cu3
Reissner
¹
,
Lorenz
²

1997
Phys. Rev. B

View full text Add to dashboard Cite

FFT Compiler Techniques

Král

Franchetti

Lorenz

et al. 2004

View full text Add to dashboard Cite

This paper presents compiler technology that targets general purpose microprocessors augmented with SIMD execution units for exploiting data level parallelism. Numerical applications are accelerated by automatically vectorizing blocks of straight line code to be run on processors featuring two-way short vector SIMD extensions like Intel's SSE 2 on Pentium 4, SSE 3 on Intel Prescott, AMD's 3DNow! , and IBM's SIMD operations implemented on the new processors of the BlueGene/L supercomputer. The paper introduces a special compiler backend for Intel P4's SSE 2 and AMD's 3DNow! which is able (i) to exploit particular properties of FFT code, (ii) to generate optimized address computation, and (iii) to perform specialized register allocation and instruction scheduling. Experiments show that the automatic SIMD vectorization techniques of this paper enable performance of hand optimized code for key benchmarks. The newly developed methods have been integrated into the codelet generator of Fftw and successfully vectorized complicated code like real-to-halfcomplex non-power of two FFT kernels. The floatingpoint performance of Fftw's scalar version has been more than doubled, resulting in the fastest FFT implementation to date.

show abstract

Juergen Lorenz

Gordon Bell finalists I---Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform

Large-Scale First-Principles Molecular Dynamics simulations on the BlueGene/L Platform using the Qbox code

Efficient Utilization of SIMD Extensions

SIMD Vectorization of Straight Line FFT Code

Large-Scale First-Principles Molecular Dynamics Simulations on the BlueGene/L Platform using the Qbox Code

Vectorization techniques for the Blue Gene/L double FPU

Comparison of results of different methods for the analysis of flux creep behavior in a melt-texturedYBa2Cu3
Reissner
¹
,
Lorenz
²

1997
Phys. Rev. B

FFT Compiler Techniques

Contact Info

Product

Resources

About

Juergen Lorenz

Gordon Bell finalists I---Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform

Large-Scale First-Principles Molecular Dynamics simulations on the BlueGene/L Platform using the Qbox code

Efficient Utilization of SIMD Extensions

SIMD Vectorization of Straight Line FFT Code

Large-Scale First-Principles Molecular Dynamics Simulations on the BlueGene/L Platform using the Qbox Code

Vectorization techniques for the Blue Gene/L double FPU

Comparison of results of different methods for the analysis of flux creep behavior in a melt-texturedYBa2Cu3Reissner1, Lorenz2 1997Phys. Rev. B

FFT Compiler Techniques

Contact Info

Product

Resources

About

Comparison of results of different methods for the analysis of flux creep behavior in a melt-texturedYBa2Cu3
Reissner
¹
,
Lorenz
²

1997
Phys. Rev. B