Matthew Horsnell scite author profile

Rogers

et al. 2007

Abstract. How can sequential applications benefit from the ubiquitous next generation of chip multiprocessors (CMP)? Part of the answer may be a dynamic execution environment that automatically parallelizes programs and adaptively tunes the work distribution. Experiments using the Jamaica CMP show how a runtime environment is capable of parallelizing standard benchmarks and achieving performance improvements over traditional work distributions.

Evaluation of Hybrid Run-Time Power Models for the ARM Big.LITTLE Architecture

Nikov

Nunez-Yanez

2015

Heterogeneous processors, formed by binary compatible CPU cores with different microarchitectures, enable energy reductions by better matching processing capabilities and software application requirements. This new hardware platform requires novel techniques to manage power and energy to fully utilize its capabilities, particularly regarding the mapping of workloads to appropriate cores. In this paper we validate relevant published work related to power modelling for heterogeneous systems and propose a new approach for developing run-time power models that uses a hybrid set of physical predictors, performance events and CPU state information. We demonstrate the accuracy of this approach compared with the state-of-the-art and its applicability to energy aware scheduling. Our results are obtained on a commercially available platform built around the Samsung Exynos 5 Octa SoC, which features the ARM big.LITTLE heterogeneous architecture.

An adaptive bloom filter cache partitioning scheme for multicore architectures

Nikas

Garside

2008

Abstract-This paper investigates the problem of partitioning the last-level shared cache of multicore architectures. Contention for such a shared resource has been shown to severely degrade performance when running multiple applications. As architectures incorporate more cores, multiple application workloads become increasingly attractive, further exacerbating contention at the last-level cache. Today, cache replacement policies, extensively studied for uniprocessor systems, are being employed within new multicore architectures with little, if any, adaptation. However the parameters in these new systems are likely to be different. The least recently used (LRU) policy, for example, which is widely accepted as the best replacement policy in uniprocessor caches, often results in poor resource sharing in a multicore system, signalling the importance of reevaluating the effectiveness of these policies in the new architectures. This paper proposes Adaptive Bloom Filter Cache Partitioning (ABFCP), a low-cost, dynamic cache partitioning mechanism capable of better resource sharing at the last-level cache than LRU, improving the performance of an eight-core system on average by 5.92% over the LRU policy. Moreover, the proposed scheme provides the equivalent performance benefits that could be gained from almost a 50% increase in the last-level cache and shows increasing benefit as the number of cores rises.

An Object-Aware Hardware Transactional Memory System

Khan

Rogers

et al. 2008

Transactional Memory (TM) is receiving attention as a way of expressing parallelism for programming multi-core systems. As a parallel programming model it is able to avoid the complexity of conventional locking. TM can enable multi-core hardware that dispenses with conventional bus-based cache coherence, resulting in simpler and more extensible systems. This is increasingly important as we move into the many-core era. Within TM, however, the processes of conflict detection and committing still require synchronization and the broadcast of data. By increasing the granularity of when synchronization is required, the demands on communication are reduced. Software implementations of TM have taken advantage of the fact that the object structure of data can be employed to further raise the level at which interference is observed. The contribution of this paper is the first hardware TM approach where the object structure is recognized and harnessed. This leads to novel commit and conflict detection mechanisms, and also to an elegant solution to the virtualization of version management, without the need for additional software TM support. A first implementation of the proposed hardware TM system is simulated. The initial evaluation is conducted with three benchmarks derived from the STAMP suite and a transactional version of Lee's routing algorithm.

Adaptive Loop Tiling for a Multi-cluster CMP

Zhao

Luján

et al.

Abstract. Loop tiling is a fundamental optimization for improving data locality. Selecting the right tile size combined with the parallelization of loops can provide additional performance increases in the modern of Chip MultiProcessor (CMP) architectures. This paper presents a runtime optimization system which automatically parallelizes loops and searches empirically for the best tile sizes on a scalable multi-cluster CMP. The system is built on top of a virtual machine and targets the runtime parallelization and optimization of Java programs. Experimental results show that runtime parallelization and tile size searching are capable of improving performance for two BLAS kernels and one Lattice-Boltzmann simulation, despite overheads.