Daniel Orozco scite author profile

Gao

2009

Abstract-This paper reports a study of mapping the Finite Difference Time Domain (FDTD) application to the IBM Cyclops-64 (C64) many-core chip architecture [1]. C64 is chosen for this study as it represents the current trend in computer architecture to develop a class of many-core architectures with distinct features e.g. software manageable on-chip memory hierarchy (vs. a hardware-managed data cache), high on-chip bandwidth, fine grain multithreading and synchronization, among others.Major results of our study include: 1. A good mapping of FDTD can effectively exploit the on-chip parallelism of C64-like architectures and show good performance and scalability.2. Such performance improvement is derived by employing a number of code optimization techniques such as time skewing and split tiling that judiciously exploit the architecture features described in (1).3. High performance requires maximum reuse of on-chip memory, which is obtained by tiling with non conventional tile shapes.4. Such code optimization techniques we used in (2) and tiling such as the one used in (3) should be implementable within a reasonable compilation framework, opening a new set of possibilities for compiler optimizations.

Locality Optimization of Stencil Applications Using Data Dependency Graphs

Gao

2011

Abstract. This paper proposes tiling techniques based on data dependencies and not in code structure. The work presented here leverages and expands previous work by the authors in the domain of non traditional tiling for parallel applications. The main contributions of this paper are: (1) A formal description of tiling from the point of view of the data produced and not from the source code. (2) A mathematical proof for an optimum tiling in terms of maximum reuse for stencil applications, addressing the disparity between computation power and memory bandwidth for many-core architectures. (3) A description and implementation of our tiling technique for well known stencil applications. (4) Experimental evidence that confirms the effectiveness of the tiling proposed to alleviate the disparity between computation power and memory bandwidth for many-core architectures. Our experiments, performed using one of the first Cyclops-64 many-core chips produced, confirm the effectiveness of our approach to reduce the total number of memory operations of stencil applications as well as the running time of the application.

TIDeFlow: The Time Iterated Dependency Flow Execution Model

Pavel

et al. 2011

The many-core revolution brought forward by recent advances in computer architecture has created immense challenges in the writing of parallel programs for High Performance Computing (HPC). Development of parallel HPC programs remains an art, and a universal doctrine for synchronization, scheduling and execution in general has not been found for many-core/multi-core architectures. These issues are exacerbated by the popularity of traditional execution models derived from the serial-programming school of thought. Previous solutions for parallel programming, such as OpenMP, MPI and similar models, require significant effort from the programmer to achieve high performance.This paper provides an introduction to the Time Iterated Dependency Flow (TIDeFlow) model, a parallel execution model inspired by dataflow, and a description of its associated runtime system. TIDeFlow was designed for efficient development of high performance parallel programs for many-core architectures.The TIDeFlow execution model was designed to efficiently express (1) parallel loops, (2) dependencies (data, control or other) between parallel loops and (3) to allow composability of programs.TIDeFlow is a work in progress. This paper presents an introduction to the TIDeFlow execution model and shows examples and preliminary results to illustrate the qualities of TIDeFlow.The main contributions of this paper are:1. A brief description of the TIDeFlow execution model, and its programming model, 2. A description of the implementation of the TIDeFlow runtime system and its capabilities and 3. Preliminary results showing the suitability of TIDeFlow to express parallel programs in many-core archiPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

A Discussion in Favor of Dynamic Scheduling for Regular Applications in Many-core Architectures

Pavel

et al. 2012

Toward high-throughput algorithms on many-core architectures

ACM Trans. Archit. Code Optim.

Khan

et al. 2012

Advanced many-core CPU chips already have a few hundreds of processing cores (e.g., 160 cores in an IBM Cyclops-64 chip) and more and more processing cores become available as computer architecture progresses. The underlying runtime systems of such architectures need to efficiently serve hundreds of processors at the same time, requiring all basic data structures within the runtime to maintain unprecedented throughput.In this paper, we analyze the throughput requirements that must be met by algorithms in runtime systems to be able to handle hundreds of simultaneous operations in real time.We reach a surprising conclusion: Many traditional algorithm techniques are poorly suited for highly parallel computing environments because of their low throughput. We reach the conclusion that the intrinsic throughput of a parallel program depends on both its algorithm and the processor architecture where the program runs. We provide theory to quantify the intrinsic throughput of algorithms, and we provide a few examples, where we describe the intrinsic throughput of existing, common algorithms. Then, we go on to explain how to follow a throughput-oriented approach to develop algorithms that have very high intrinsic throughput in many core architectures. We compare our throughput-oriented algorithms with other well known algorithms that provide the same functionality and we show that a throughput-oriented design produces algorithms with equal or faster performance in highly concurrent environments. We provide both theoretical and experimental evidence showing that our algorithms are excellent choices over other state of the art algorithms.The major contributions of this paper are (1) motivating examples that show the importance of throughput in concurrent algorithms; (2) a mathematical framework that uses queueing theory to describe the intrinsic throughput of algorithms; (3) two highly concurrent algorithms with very high intrinsic throughput that are useful for task management in runtime systems; and (4) extensive experimental and theoretical results that show that for highly parallel systems, our proposed algorithms allow greater or at least equal scalability and performance than other well-known similar state-of-the-art algorithms.