2010
DOI: 10.1145/1837853.1693482
|View full text |Cite
|
Sign up to set email alerts
|

Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Abstract: Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
44
1

Year Published

2010
2010
2015
2015

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 35 publications
(45 citation statements)
references
References 35 publications
0
44
1
Order By: Relevance
“…However, if the total required 1 The motivation of this work is the study of the performance and scalability of data parallel programs whose threads executed the same code but working on different parts of the application's data. For such applications, the threads equal demand for resources results in a equal distribution of cache capacity across the threads [7,10]. bandwidth is greater than the system's bandwidth, we can conclude that with this number of threads the application is bandwidth-limited, and will not achieve the expected throughput.…”
Section: Motivationmentioning
confidence: 98%
See 1 more Smart Citation
“…However, if the total required 1 The motivation of this work is the study of the performance and scalability of data parallel programs whose threads executed the same code but working on different parts of the application's data. For such applications, the threads equal demand for resources results in a equal distribution of cache capacity across the threads [7,10]. bandwidth is greater than the system's bandwidth, we can conclude that with this number of threads the application is bandwidth-limited, and will not achieve the expected throughput.…”
Section: Motivationmentioning
confidence: 98%
“…ity of multithreaded data parallel programs. For such programs, the similar execution of each thread leads to an equal distribution of memory resources [7,10]. This suggests that if we knew how a single thread's performance and off-chip bandwidth demand change as a function of its shared cache space allocation, we could determine the performance and bandwidth demand as a function of the number of threads, and therefore predict how the application will scale.…”
Section: Introductionmentioning
confidence: 99%
“…Although reduced input sets are available for simulation studies, we use the largest input sets designed for native execution. Previous characterizations of PARSEC have found it to be sensitive to cache capacity [4], but also resilient to performance degradation in the face of intra-application cache sharing [39], .…”
Section: Description Of Workloadsmentioning
confidence: 99%
“…This work is not aimed to produce a universally applicable, fastest LU decomposition, but to use LU decomposition as an example problem to reveal the interactions among different optimizations on Cell B/E and obtain the insights in holistic optimizations for heterogeneous multicore architecture. Locality optimization has been a focus in many previous studies, especially on traditional CPU and modern multicores [13,10]. For LU decomposition, an example is the automatic blocking for improving its locality on SMP [12].…”
Section: Related Workmentioning
confidence: 99%