Compiling for an indirect vector register architecture

Nuzman, Dorit; Namolaru, Mircea; Zaks, Ayal; Derby, J.

doi:10.1145/1366230.1366266

Cited by 5 publications

(2 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The difficulties of optimizing for a wide range of SIMD vector architectures are discussed in [29,14]. In addition, several other works have addressed the exploitation of SIMD instruction sets [22,24,23,30,32,31,28]. All of these works only address SIMD hardware alignment issues.…”

Section: Related Workmentioning

confidence: 99%

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Henretty

Stock

Pouchet

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on several modern processors with SIMD capabilities.

show abstract

Section: Related Workmentioning

confidence: 99%

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Henretty

Stock

Pouchet

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…The high-level data-reuse carried by the outer-loops in these loop nests can be detected and exploited only if operating at the level of the outer-loop. For this reason we have implemented an in-place vectorization approach that directly vectorizes the outer-loop [31][32][33][34][35][36], instead of the traditional approach of interchanging an outer-loop with the inner-most loop, followed by vectorizing it at the inner-most position [28]. The cost model we developed is capable of guiding the compiler which of these two alternatives is expected to be more profitable (as explained in the following Section).…”

Section: Transformation Namementioning

confidence: 99%

ACOTES Project: Advanced Compiler Technologies for Embedded Streaming

Munk

Ayguadé

Bastoul

et al. 2010

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes H. Munk (B) · Z. Chamski · P. Dumont · M. Duranton NXP Semiconductors, Eindhoven, The Netherlands e-mail: munkharm@xs4all.nl 123Int J Parallel Prog in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer's productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.

show abstract

Automatic efficient data layout for multithreaded stencil codes on CPU sand GPUs

Jaeger

Barthou

2012

2012 19th International Conference on High Performance Computing

View full text Add to dashboard Cite

HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

show abstract

Compiling for an indirect vector register architecture

Cited by 5 publications

References 20 publications

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

ACOTES Project: Advanced Compiler Technologies for Embedded Streaming

Automatic efficient data layout for multithreaded stencil codes on CPU sand GPUs

Contact Info

Product

Resources

About