Power, Programmability, and Granularity: The Challenges of ExaScale Computing

Dally, Bill

doi:10.1109/ipdps.2011.420

Cited by 46 publications

(27 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These models enable the analysis of data transfer between two levels of the memory hierarchy. Lower data transfer complexity implies better data locality and, therefore, higher energy efficiency since energy consumption caused by data transfer dominates the total energy consumption [18].…”

Section: Bounded Ideal Cache Modelmentioning

confidence: 99%

“…Unlike conventional locality-aware data structures and algorithms that only concern whether the data is on-chip (e.g., in cache) or not (e.g., in DRAM), new energy-efficient data structures and algorithms must consider data locality in finer-granularity: where on chip the data is 1 . It is estimated that for chips using the 10nm technology, the energy gap between accessing data in nearby on-chip memory (e.g., data in SRAM) and accessing data across the chip (e.g., on-chip data at the distance of 10mm), will be as much as 75x (2pJ versus 150pJ), whereas the energy gap between accessing on-chip data and accessing off-chip data (e.g., data in DRAM) will be only 2x (150pJ versus 300pJ) [18]. Therefore, in order to construct energy efficient software systems, data structures and algorithms should support not only high parallelism but also 1.…”

Section: Introductionmentioning

confidence: 99%

“…Therefore, in order to construct energy efficient software systems, data structures and algorithms should support not only high parallelism but also 1. In the rest of this paper, the term fine-grained locality is used to refer to data locality on-chip (e.g., data movement between registers, L1-/ L2-caches and last level cache (LLC)) while the term coarse-grained locality is used to refer to data locality off-chip (e.g., data movement between DRAM and LLC) fine-grained data locality [18].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient Concurrent Search Trees Using Portable Fine-Grained Locality

Anshus

Umar

2019

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Concurrent search trees are crucial data abstractions widely used in many important systems such as databases, file systems and data storage. Like other fundamental abstractions for energy-efficient computing, concurrent search trees should support both high concurrency and fine-grained data locality in a platform-independent manner. However, existing portable fine-grained locality-aware search trees such as ones based on the van Emde Boas layout (vEB-based trees) poorly support concurrent update operations while existing highlyconcurrent search trees such as non-blocking search trees do not consider fine-grained data locality. In this paper, we first present a novel methodology to achieve both portable fine-grained data locality and high concurrency for search trees. Based on the methodology, we devise a novel locality-aware concurrent search tree called GreenBST. To the best of our knowledge, GreenBST is the first practical search tree that achieves both portable fine-grained data locality and high concurrency. We analyze and compare GreenBST energy efficiency (in operations/Joule) and performance (in operations/second) with seven prominent concurrent search trees on a high performance computing (HPC) platform (Intel Xeon), an embedded platform (ARM), and an accelerator platform (Intel Xeon Phi) using parallel micro-benchmarks (Synchrobench). Our experimental results show that GreenBST achieves the best energy efficiency and performance on all the different platforms. GreenBST achieves up to 50% more energy efficiency and 60% higher throughput than the best competitor in the parallel benchmarks. These results confirm the viability of our new methodology to achieve both portable fine-grained data locality and high concurrency for search trees.

show abstract

Section: Bounded Ideal Cache Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Concurrent Search Trees Using Portable Fine-Grained Locality

Anshus

Umar

2019

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…They have been adopted in many supercomputers, e.g., Titan, Stampede, and Tianhe-2, mainly for two purposes: (1) improving the performance, and (2) reducing the overall power consumption [1]. As GPUs are becoming ubiquitous in HPC, numerous applications have been ported to GPUbased systems over the past several years, including large scale scientific applications on GPU clusters [2]- [4].…”

Section: Introductionmentioning

confidence: 99%

Accelerating Applications Using GPUs on Embedded Systems and Mobile Devices

Huang

Lai

2013

2013 IEEE 10th International Conference on High Performance Computing and Communications &Amp; 2013 IEEE International Conferen

View full text Add to dashboard Cite

Graphics processing units (GPUs) are capable of achieving remarkable performance improvements for a broad range of applications. However, they have not been widely adopted in embedded systems and mobile devices as accelerators mainly due to their relatively higher power consumption compared with embedded microprocessors. In this work, we conduct a comprehensive analysis regarding the feasibility and potential of accelerating applications using GPUs in low-power domains. We use two different categories of benchmarks: (1) the Level 3 BLAS subroutines, and (2) the computer vision algorithms, i.e., mean shift image segmentation and scale-invariant feature transform (SIFT). We carried out our experiments on the Nvidia CARMA development kit, which consists of a Nvidia Tegra 3 quad-core CPU and a Nvidia Quadro 1000M GPU. It is found that the GPU can deliver a remarkable performance speedup compared with the CPU while using a significantly less energy for most benchmarks. Further we propose a hybrid approach to developing applications on platform with GPU accelerators. This approach optimally distributes workload between the parallel GPU and the sequential CPU to achieve the best performance while using the least energy.

show abstract

“…With each level of the memory hierarchy that a data transfer crosses (e.g. between on-chip caches, or from last-level cache to DRAM), the energy consumption of the transfer increases by one order of magnitude or more [4]. The memory hierarchy remains the most important performance factor in computing systems, as latency keeps lagging bandwidth [5].…”

Section: Introductionmentioning

confidence: 99%

BTL: A Framework for Measuring and Modeling Energy in Memory Hierarchies

Manousakis

Nikolopoulos²

2012

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

View full text Add to dashboard Cite

Understanding the energy efficiency of computing systems is paramount. Although processors remain dominant energy consumers and the focal target of energy-aware optimization in computing systems, the memory subsystem dissipates substantial amounts of power, which at high densities may exceed 50% of total system power. The failure of DRAM to keep up with increasing processor speeds, creates a two-pronged bottleneck for overall system energy efficiency. This paper presents a highperformance, autonomic power instrumentation setup to measure energy consumption in computing systems and accurately attribute energy to processors and components of the memory hierarchy. We provide a set of carefully engineered microbenchmarks that reveal the energy efficiency under different memory access patterns and stress the importance of minimizing costly data transfers that involve multiple levels of the system's memory hierarchy. Lastly, we present BTL (Bottomline), a processor specific model for deriving lower bounds of energy consumption. BTL predicts the minimum dynamic energy consumption for any workload, thus uncovering opportunities for energy optimization.

show abstract

Power, Programmability, and Granularity: The Challenges of ExaScale Computing

Cited by 46 publications

References 0 publications

Efficient Concurrent Search Trees Using Portable Fine-Grained Locality

Efficient Concurrent Search Trees Using Portable Fine-Grained Locality

Accelerating Applications Using GPUs on Embedded Systems and Mobile Devices

BTL: A Framework for Measuring and Modeling Energy in Memory Hierarchies

Contact Info

Product

Resources

About