A Tile Size Selection Analysis for Blocked Array Layouts

Athanasaki, Evangelia; Koziris, Nectarios; Tsanakas, Panayiotis

doi:10.1109/interact.2005.1

Cited by 4 publications

(3 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Taking into account the miss penalty of each memory level, as well as the penalty of mispredicted branches (as presented in [2]), we derive the total miss cost of Table 2. D-TLB misses requirements MTLB Figure 4 makes clear that L1 misses dominate cache and, as a result, total performance in the Xeon DP architecture.…”

Section: Total Miss Costmentioning

confidence: 99%

Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT Architectures

Athanasaki

Kourtis

Anastopoulos

et al. 2005

Advances in Informatics

View full text Add to dashboard Cite

Abstract. Cache misses form a major bottleneck for memory-intensive applications, due to the significant latency of main memory accesses. Loop tiling, in conjunction with other program transformations, have been shown to be an effective approach to improving locality and cache exploitation, especially for dense matrix scientific computations. Beyond loop nest optimizations, data transformation techniques, and in particular blocked data layouts, have been used to boost the cache performance. The stability of performance improvements achieved are heavily dependent on the appropriate selection of tile sizes.In this paper, we investigate the memory performance of blocked data layouts, and provide a theoretical analysis for the multiple levels of memory hierarchy, when they are organized in a set associative fashion. According to this analysis, the optimal tile size that maximizes L1 cache utilization, should completely fit in the L1 cache, even for loop bodies that access more than just one array. Increased self-or/and cross-interference misses can be tolerated through prefetching. Such larger tiles also reduce mispredicted branches and, as a result, the lost CPU cycles that arise. Results are validated through actual benchmarks on an SMT platform.

show abstract

Section: Total Miss Costmentioning

confidence: 99%

Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT Architectures

Athanasaki

Kourtis

Anastopoulos

et al. 2005

Advances in Informatics

View full text Add to dashboard Cite

show abstract

“…As we will comment below, our results agree with this: our iterative tiled algorithm working on SB outperforms the recursive code operating on hypermatrices. Authors have also investigated on tile size selection for non-canonical array layouts [28,22,29] and have come to similar conclusions to the case of canonical storage: blocks should target the level 1 cache.…”

Section: Serial Dense Codes Using Non-canonical Array Layoutsmentioning

confidence: 84%

Using Non-canonical Array Layouts in Dense Matrix Operations

Herrero

Navarro

Applied Parallel Computing. State of the Art in Scientific Computing

View full text Add to dashboard Cite

We present two implementations of dense matrix multiplication based on two different non-canonical array layouts: one based on a hypermatrix data structure (HM) where data submatrices are stored using a recursive layout; the other based on a simple block data layout with square blocks (SB) where blocks are arranged in column-major order. We show that the iterative code using SB outperforms a recursive code using HM and obtains competitive results on a variety of platforms.

show abstract

“…As we will comment below, our results agree with this: our iterative tiled algorithm working on BDL outperforms the recursive code operating on hypermatrices. Authors have also investigated on tile size selection for nonlinear array layouts [165,145,15] and have come to similar conclusions to the case of canonical storage: blocks should target the level 1 cache.…”

Section: Serial Dense Codes Using Nonlinear Array Layoutsmentioning

confidence: 84%

A framework for efficient execution of matrix computations

Herrero¹

View full text Add to dashboard Cite

Matrix computations lie at the heart of most scientific computational tasks. The solution of linear systems of equations is a very frequent operation in many fields in science, engineering, surveying, physics and others. Other matrix operations occur frequently in many other fields such as pattern recognition and classification, or multimedia applications. Therefore, it is important to perform matrix operations efficiently. The work in this thesis focuses on the efficient execution on commodity processors of matrix operations which arise frequently in different fields. We study some important operations which appear in the solution of real world problems: some sparse and dense linear algebra codes and a classification algorithm. In particular, we focus our attention on the efficient execution of the following operations: sparse Cholesky factorization; dense matrix multiplication; dense Cholesky factorization; and Nearest Neighbor Classification. A lot of research has been conducted on the efficient parallelization of numerical algorithms. However, the efficiency of a parallel algorithm depends ultimately on the performance obtained from the computations performed on each node. The work presented in this thesis focuses on the sequential execution on a single processor. There exists a number of data structures for sparse computations which can be used in order to avoid the storage of and computation on zero elements. We work with a hierarchical data structure known as hypermatrix. A matrix is subdivided recursively an arbitrary number of times. Several pointer matrices are used to store the location of submatrices at each level. The last level consists of data submatrices which are dealt with as dense submatrices. When the block size of this dense submatrices is small, the number of zeros can be greatly reduced. However, the performance obtained from BLAS3 routines drops heavily. Consequently, there is a trade-off in the size of data submatrices used for a sparse Cholesky factorization with the hypermatrix scheme. Our goal is that of reducing the overhead introduced by the unnecessary operation on zeros when a hypermatrix data structure is used to produce a sparse Cholesky factorization. In this work we study several techniques for reducing such overhead in order to obtain high performance. One of our goals is the creation of codes which work efficiently on different platforms when operating on dense matrices. To obtain high performance, the resources offered by the CPU must be properly utilized. At the same time, the memory hierarchy must be exploited to tolerate increasing memory latencies. To achieve the former, we produce inner kernels which use the CPU very efficiently. To achieve the latter, we investigate nonlinear data layouts. Such data formats can contribute to the effective use of the memory system. The use of highly optimized inner kernels is of paramount importance for obtaining efficient numerical algorithms. Often, such kernels are created by hand. However, we want to create efficient inner kernels for a variety of processors using a general approach and avoiding hand-made codification in assembly language. In this work, we present an alternative way to produce efficient kernels automatically, based on a set of simple codes written in a high level language, which can be parameterized at compilation time. The advantage of our method lies in the ability to generate very efficient inner kernels by means of a good compiler. Working on regular codes for small matrices most of the compilers we used in different platforms were creating very efficient inner kernels for matrix multiplication. Using the resulting kernels we have been able to produce high performance sparse and dense linear algebra codes on a variety of platforms. In this work we also show that techniques used in linear algebra codes can be useful in other fields. We present the work we have done in the optimization of the Nearest Neighbor classification focusing on the speed of the classification process. Tuning several codes for different problems and machines can become a heavy and unbearable task. For this reason we have developed an environment for development and automatic benchmarking of codes which is presented in this thesis. As a practical result of this work, we have been able to create efficient codes for several matrix operations on a variety of platforms. Our codes are highly competitive with other state-of-art codes for some problems.

show abstract

A Tile Size Selection Analysis for Blocked Array Layouts

Cited by 4 publications

References 28 publications

Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT Architectures

Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT Architectures

Using Non-canonical Array Layouts in Dense Matrix Operations

A framework for efficient execution of matrix computations

Contact Info

Product

Resources

About