Optimizing Chip Multiprocessor Work Distribution Using Dynamic Compilation

Zhao, Jisheng; Horsnell, Matthew; Rogers, Ian; Dinn, Andrew; Kirkham, Chris; Watson, Ian

doi:10.1007/978-3-540-74466-5_29

Cited by 5 publications

(8 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zhao et al [40,39] have also implemented loop parallelization in the context of JikesRVM. However, rather than GPUs, their intended target is JAMAICA [2], a multi-processor parallel architecture.…”

Section: Related Workmentioning

confidence: 99%

Automatic parallelization for graphics processing units

Leung

Lhoták

Lashari

2009

Proceedings of the 7th International Conference on Principles and Practice of Programming in Java

View full text Add to dashboard Cite

Accelerated graphics cards, or Graphics Processing Units (GPUs), have become ubiquitous in recent years. On the right kinds of problems, GPUs greatly surpass CPUs in terms of raw performance. However, because they are difficult to program, GPUs are used only for a narrow class of special-purpose applications; the raw processing power made available by GPUs is unused most of the time.This paper presents an extension to a Java JIT compiler that executes suitable code on the GPU instead of the CPU. Both static and dynamic features are used to decide whether it is feasible and beneficial to off-load a piece of code on the GPU. The paper presents a cost model that balances the speedup available from the GPU against the cost of transferring input and output data between main memory and GPU memory. The cost model is parameterized so that it can be applied to different hardware combinations. The paper also presents ways to overcome several obstacles to parallelization inherent in the design of the Java bytecode language: unstructured control flow, the lack of multi-dimensional arrays, the precise exception semantics, and the proliferation of indirect references.

show abstract

Section: Related Workmentioning

confidence: 99%

Automatic parallelization for graphics processing units

Leung

Lhoták

Lashari

2009

Proceedings of the 7th International Conference on Principles and Practice of Programming in Java

View full text Add to dashboard Cite

show abstract

“…The Online Tuning Framework (OTF) infrastructure, initially developed for CMP loop optimizations [19], performs automatic parallelization and enables runtime empirical search. It consists of three distinct elements: the Loop Parallelizing Compiler (LPC), the adaptive optimization component (see Section 3.1), and the runtime profiler (see Section 3.2).…”

Section: Online Tuning Frameworkmentioning

confidence: 99%

“…In the current implementation, 2-dimensional loop traversals of the iteration space are divided into tiles which are then distributed among automatically generated parallel threads. We extend the basic empirical search algorithm [19] to vary the number of loop iterations inside each tile for the clusters and levels of the memory hierarchy. These parameters directly impact the balance between costs associated with thread management, the cache efficiency, and system load.…”

Section: Adaptive Optimization Componentmentioning

confidence: 99%

“…Based on our previous research [19], at runtime it is feasible to automatically parallelize loops and also empirically search for adequate loop tiling sizes in CMP architectures with acceptable overheads. In this paper we concentrate on multicluster CMPs and whether adequate loop tiling sizes can be found at runtime for the automatically parallelized loops.…”

Section: Introductionmentioning

confidence: 99%

“…The L2 cache is unified containing both data and instructions, further complicating predictions as to how much space is available to data alone.For a multi-cluster CMP system which connects all the clusters by the L2 cache bus, the data locality in each L2 cache determines significantly the runtime performance. This paper investigates optimizations that search for multiple tile sizes to best utilize two levels of on-chip caching in a multi-cluster CMP, using runtime information to drive the search algorithm, in conjunction with an Online Tuning Framework (OTF) [19]. To exploit the cache hierarchy and the cluster structure two tile sizes need to be determined.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Adaptive Loop Tiling for a Multi-cluster CMP

Zhao

Horsnell

Luján

et al.

Algorithms and Architectures for Parallel Processing

Self Cite

View full text Add to dashboard Cite

Abstract. Loop tiling is a fundamental optimization for improving data locality. Selecting the right tile size combined with the parallelization of loops can provide additional performance increases in the modern of Chip MultiProcessor (CMP) architectures. This paper presents a runtime optimization system which automatically parallelizes loops and searches empirically for the best tile sizes on a scalable multi-cluster CMP. The system is built on top of a virtual machine and targets the runtime parallelization and optimization of Java programs. Experimental results show that runtime parallelization and tile size searching are capable of improving performance for two BLAS kernels and one Lattice-Boltzmann simulation, despite overheads.

show abstract