Threads vs. caches: Modeling the behavior of parallel workloads

Guz, Zvika; Itzhak, Oved; Keidar, Idit; Kolodny, Avinoam; Mendelson, Avi; Weiser, Uri

doi:10.1109/iccd.2010.5647747

Cited by 16 publications

(7 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Kgil et al [29] show that, for a particular class of throughput-oriented web workloads, modern processors are extremely power inefficient, arguing that the chip area should be used for processing cores rather than caches. A similar observation has been made in the GPU domain [18]. Our results corroborate these findings, showing that, for scale-out workloads, the time spent accessing the large and slow last-level caches accounts for more than half of the data stalls [22], calling for resizing and reorganizing the cache hierarchy.…”

Section: Related Worksupporting

confidence: 87%

Clearing the clouds

et al. 2012

View full text Add to dashboard Cite

Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency to ensure that server hardware closely matches the needs of scale-out workloads.In this work, we introduce CloudSuite, a benchmark suite of emerging scale-out workloads. We use performance counters on modern servers to study scale-out workloads, finding that today's predominant processor micro-architecture is inefficient for running these workloads. We find that inefficiency comes from the mismatch between the workload needs and modern processors, particularly in the organization of instruction and data memory systems and the processor core micro-architecture. Moreover, while today's predominant micro-architecture is inefficient when executing scale-out workloads, we find that continuing the current trends will further exacerbate the inefficiency in the future. In this work, we identify the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers. Categories and Subject Descriptors C.4 [Performance of Systems]: Performance of Systems -Design studiesGeneral Terms Design, Measurement, Performance• Instruction-and memory-level parallelism in scale-out workloads is low. Modern aggressive out-of-order cores are excessively complex, consuming power and on-chip area without providing performance benefits to scale-out workloads.• Data working sets of scale-out workloads considerably exceed the capacity of on-chip caches. Processor real-estate and power are misspent on large last-level caches that do not contribute to improved scale-out workload performance.

show abstract

Section: Related Worksupporting

confidence: 87%

Clearing the clouds

et al. 2012

View full text Add to dashboard Cite

show abstract

“…Consequently, PCAL does not transition to the global optimum at N = 15. Therefore, when there are multiple performance peaks in the {N, p} solution space, as is the case in GPUs [16], [17], PCAL becomes prone to a local optimum point that is nearest to the starting point. Furthermore, even by avoiding local optima through advanced search techniques such as stochastic search (as discussed in Section VII-J), when the starting point is far from the global optimum (as is the case in the above example), it would require multiple iterations to converge on a solution.…”

Section: Pitfalls In Prior Techniquesmentioning

confidence: 99%

“…This is because of the following two reasons. Firstly, traditional heuristic-based search techniques are prone to local optima in presence of multiple performance peaks, as is the case in GPUs [16], [17], thereby leading to sub-optimal solutions. Secondly, iterative search techniques are slow and expensive, particularly in hardware, due to the time spent in sampling to generate new iteration points.…”

Section: Introductionmentioning

confidence: 99%

Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning

Dublish

Nagarajan

Topham

2019

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

GPUs employ a high degree of thread-level parallelism (TLP) to hide the long latency of memory operations. However, the consequent increase in demand on the memory system causes pathological effects such as cache thrashing and bandwidth bottlenecks. As a result, high degrees of TLP can adversely affect system throughput. In this paper, we present Poise, a novel approach for balancing TLP and memory system performance in GPUs. Poise has two major components: a machine learning framework and a hardware inference engine. The machine learning framework comprises a regression model that is trained offline on a set of profiled kernels to learn best warp scheduling decisions. At runtime, the hardware inference engine uses the previously learned model to dynamically predict best warp scheduling decisions for unseen applications. Therefore, Poise helps in optimizing entirely new applications without posing any profiling, training or programming burden on the end-user. Across a set of benchmarks that were unseen during training, Poise achieves a speedup of up to 2.94× and a harmonic mean speedup of 46.6%, over the baseline greedythen-oldest warp scheduler. Poise is extremely lightweight and incurs a minimal hardware overhead of around 41 bytes per SM. It also reduces the overall energy consumption by an average of 51.6%. Furthermore, Poise outperforms the prior state-ofthe-art warp scheduler by an average of 15.1%. In effect, Poise solves a complex hardware optimization problem with considerable accuracy and efficiency.

show abstract

“…Guz et al [1] [14] describe a "performance valley" that exists between systems in which the working set of the active threads fits primarily within the cache and systems with massive multithreading where fine-grained context-switching can hide the memory latency. Several variants of thread throttling mechanisms have been proposed to climb out of this "valley", hence moving the threads to the larger per-thread capacity domain.…”

Section: Related Workmentioning

confidence: 99%

Priority-based cache allocation in throughput processors

Liu

Rhu

Johnson³

et al. 2015

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can increase contention for various system resources, however, that may result in suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache contention and improve performance. Throttling approaches can, however, lead to under-utilizing thread contexts, on-chip interconnect, and offchip memory bandwidth. This paper proposes to tightly couple the thread scheduling mechanism with the cache management algorithms such that GPU cache pollution is minimized while offchip memory throughput is enhanced. We propose priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache. By tuning thread-level parallelism while both optimizing caching efficiency as well as other shared resource usage, PCAL builds upon previous thread throttling approaches, improving overall performance by an average 17% with maximum 51%. 89 978-1-4799-8930-0/15/$31.00 ©2015 IEEE

show abstract

Threads vs. caches: Modeling the behavior of parallel workloads

Cited by 16 publications

References 16 publications

Clearing the clouds

Clearing the clouds

Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning

Priority-based cache allocation in throughput processors

Contact Info

Product

Resources

About