2010 IEEE International Conference on Computer Design 2010
DOI: 10.1109/iccd.2010.5647747
|View full text |Cite
|
Sign up to set email alerts
|

Threads vs. caches: Modeling the behavior of parallel workloads

Abstract: -A new generation of high-performance engines now combine graphics-oriented parallel processors with a cache architecture. In order to meet this new trend, new highlyparallel workloads are being developed. However, it is often difficult to predict how a given application would perform on a given architecture.This paper provides a new model capturing the behavior of such parallel workloads on different multi-core architectures. Specifically, we provide a simple analytical model, which, for a given application, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
5
0

Year Published

2012
2012
2019
2019

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(7 citation statements)
references
References 16 publications
2
5
0
Order By: Relevance
“…Kgil et al [29] show that, for a particular class of throughput-oriented web workloads, modern processors are extremely power inefficient, arguing that the chip area should be used for processing cores rather than caches. A similar observation has been made in the GPU domain [18]. Our results corroborate these findings, showing that, for scale-out workloads, the time spent accessing the large and slow last-level caches accounts for more than half of the data stalls [22], calling for resizing and reorganizing the cache hierarchy.…”
Section: Related Worksupporting
confidence: 87%
“…Kgil et al [29] show that, for a particular class of throughput-oriented web workloads, modern processors are extremely power inefficient, arguing that the chip area should be used for processing cores rather than caches. A similar observation has been made in the GPU domain [18]. Our results corroborate these findings, showing that, for scale-out workloads, the time spent accessing the large and slow last-level caches accounts for more than half of the data stalls [22], calling for resizing and reorganizing the cache hierarchy.…”
Section: Related Worksupporting
confidence: 87%
“…Consequently, PCAL does not transition to the global optimum at N = 15. Therefore, when there are multiple performance peaks in the {N, p} solution space, as is the case in GPUs [16], [17], PCAL becomes prone to a local optimum point that is nearest to the starting point. Furthermore, even by avoiding local optima through advanced search techniques such as stochastic search (as discussed in Section VII-J), when the starting point is far from the global optimum (as is the case in the above example), it would require multiple iterations to converge on a solution.…”
Section: Pitfalls In Prior Techniquesmentioning
confidence: 99%
“…This is because of the following two reasons. Firstly, traditional heuristic-based search techniques are prone to local optima in presence of multiple performance peaks, as is the case in GPUs [16], [17], thereby leading to sub-optimal solutions. Secondly, iterative search techniques are slow and expensive, particularly in hardware, due to the time spent in sampling to generate new iteration points.…”
Section: Introductionmentioning
confidence: 99%
“…Guz et al [1] [14] describe a "performance valley" that exists between systems in which the working set of the active threads fits primarily within the cache and systems with massive multithreading where fine-grained context-switching can hide the memory latency. Several variants of thread throttling mechanisms have been proposed to climb out of this "valley", hence moving the threads to the larger per-thread capacity domain.…”
Section: Related Workmentioning
confidence: 99%