Runnemede: An architecture for Ubiquitous High-Performance Computing

Carter, N. P.; Agrawal, Abhishek; Borkar, S.; Cledat, Romain; David, Howard; Dunning, Dave; Fryman, Josh; Ganev, Ivan; Golliver, R. A.; Knauerhase, Rob; Lethin, Richard; Meister, Benoit; Mishra, Asit K.; Pinfold, Wilfred; Teller, Justin; Torrellas, Josep; Vasilache, Nicolas; Venkatesh, Ganesh; Xu, Jianzhong

doi:10.1109/hpca.2013.6522319

Cited by 77 publications

(53 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Future and emerging many-core processors, such as Intel's Runnemeede [5], will provide communication pathways through distributed address spaces or shared address spaces, both on-chip and off-chip. The idea elaborated in this work is to use distributed address spaces in runtime system stages where cores share no application data and need to exchange only control messages for the purposes of scheduling and load balancing.…”

Section: Discussionmentioning

confidence: 99%

“…However, processors designed for more specialized markets, such as high performance computing and large-scale data processing, use memory hierarchies without a coherence protocol. Graphics Processing Units (GPUs) [2], the Intel SCC [3] the Cell processor [4] and the experimental Runnemede prototype [5] are representative examples of non cache-coherent architectures. Programming a non-coherent architecture requires explicit communication between local address spaces, through message passing or Direct Memory Access (DMA).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Hybrid address spaces: A methodology for implementing scalable high-level programming models on non-coherent many-core architectures

Papagiannis

Nikolopoulos

2014

Journal of Systems and Software

View full text Add to dashboard Cite

This paper introduces hybrid address spaces as a fundamental design methodology for implementing scalable runtime systems on many-core architectures without hardware support for cache coherence. We use hybrid address spaces for an implementation of MapReduce, a programming model for large-scale data processing, and the implementation of a remote memory access (RMA) model. Both implementations are available on the Intel SCC and are portable to similar architectures. We present the design and implementation of HyMR, a MapReduce runtime system whereby different stages and the synchronization operations between them alternate between a distributed memory address space and a shared memory address space, to improve performance and scalability. We compare HyMR to a reference implementation and we find that HyMR improves performance by a factor of 1.71× over a set of representative MapReduce benchmarks. We also compare HyMR with Phoenix++, a state-of-art implementation for systems with hardware-managed cache coherence in terms of scalability and sustained to peak data processing bandwidth, where HyMR demonstrates improvements of a factor of 3.1× and 3.2× respectively. We further evaluate our hybrid remote memory access (HyRMA) programming model and assess its performance to be superior of that of message passing.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Hybrid address spaces: A methodology for implementing scalable high-level programming models on non-coherent many-core architectures

Papagiannis

Nikolopoulos

2014

Journal of Systems and Software

View full text Add to dashboard Cite

show abstract

“…This results in energy-inefficient designs. On-chip networks can already consume a substantial fraction of the on-chip power -potentially up to 30-40%, according to the literature [4,6,8,13,17,29]. Conservative future network designs, needed to tolerate parameter variations, may be unable to reduce the value of this fraction much.…”

Section: Introductionmentioning

confidence: 99%

Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks

Ansari

Mishra

et al. 2014

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

“…The Rigel architecture [21] proposes having clusters of cores with L1 instruction caches and incoherent L2 caches (per cluster), together with a global shared L3 cache. Finally, the Runnemede architecture [8] also relies on a dataflow execution model to execute in a near-threshold computing environment, with multiple clusters of homogeneous cores and a hierarchy of local memories. In this architecture, coherence between clusters is fully managed in software.…”

Section: Runtime-aware Architecturesmentioning

confidence: 99%

Runtime-Aware Architectures: A First Approach

Valero

Moretó

Casas

et al. 2014

JSFI

View full text Add to dashboard Cite

In the last few years, the traditional ways to keep the increase of hardware performance at the rate predicted by Moore's Law have vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while hardware designers were able to aggressively exploit instruction-level parallelism (ILP) in superscalar processors. With the irruption of multi-cores and parallel applications, this simple interface started to leak. As a consequence, the role of decoupling again applications from the hardware was moved to the runtime system. Efficiently using the underlying hardware from this runtime without exposing its complexities to the application has been the target of very active and prolific research in the last years.Current multi-cores are designed as simple symmetric multiprocessors (SMP) on a chip. However, we believe that this is not enough to overcome all the problems that multi-cores already have to face. It is our position that the runtime has to drive the design of future multi-cores to overcome the restrictions in terms of power, memory, programmability and resilience that multi-cores have. In this paper, we introduce a first approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime's perspective.Keywords: Parallel architectures, runtime system, hardware-software co-design. IntroductionWhen uniprocessors were the norm, Instruction Level Parallelism (ILP) and Data Level Parallelism (DLP) were widely exploited to increase the number of instructions executed per cycle. The main hardware designs that were used to exploit ILP were superscalar and Very Long Instruction Word (VLIW) processors. The VLIW approach requires to statically determine dependencies between instructions and schedule them. However, since it is not possible in general to obtain optimal schedulings at compile time, VLIW does not fully exploit the potential ILP that many workloads have. Superscalar designs try to overcome the increasing memory latencies, the so called Memory Wall [42], by using Out of Order (OoO) and speculative executions [18]. Additionally, techniques such as prefetching, to start fetching data from the memory ahead of time, deep memory hierarchies, to exploit the locality that many programs have, and large reorder buffers, to increase the number of speculative instructions exposed to the hardware, have been also used to enhance superscalar processors performance. DLP is typically expressed explicitly at the software layer and it consisted in a parallel operation on multiple data performed by multiple independent instructions, or by multiple independent threads. In uniprocessors, the Instruction Set Architecture (ISA) was in charge of decoupling the application, written in a highlevel programming language, and the hardware, as we can see in the left hand side of Figure 1. In this ...

show abstract

Runnemede: An architecture for Ubiquitous High-Performance Computing

Cited by 77 publications

References 26 publications

Hybrid address spaces: A methodology for implementing scalable high-level programming models on non-coherent many-core architectures

Hybrid address spaces: A methodology for implementing scalable high-level programming models on non-coherent many-core architectures

Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks

Runtime-Aware Architectures: A First Approach

Contact Info

Product

Resources

About