Optimizing thread throughput for multithreaded workloads on memory constrained CMPs

Bhadauria, Major; McKee, Sally A.

doi:10.1145/1366230.1366256

Cited by 6 publications

(8 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many high-performance scientific and commercial workloads running on shared-memory systems use time-sharing of programs, gang-scheduling their respective threads (in an effort to obtain best performance from less thrashing and fewer conflicts for shared resources). This scheduling policy thus provides the baseline for previous studies [3,24,6]. In contrast, we discover that for several multithreaded programs better performance results from space-sharing rather than time-sharing the CMP.…”

Section: Introductionmentioning

confidence: 83%

“…Isci et al [13] and Herbert et al [12] examine scaling frequency when the processor is constrained by memory bottlenecks. Bhadauria and McKee [3] find that memory constraints often render the optimal thread count to be fewer than the total number of processors on a CMP. Curtis-maury et al [8] predict efficient concurrency levels for parallel regions of multithreaded programs.…”

Section: Thread Scalingmentioning

confidence: 99%

See 1 more Smart Citation

An approach to resource-aware co-scheduling for CMPs

Bhadauria

McKee

2010

Proceedings of the 24th ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

We develop real-time scheduling techniques for improving performance and energy for multiprogrammed workloads that scale nonuniformly with increasing thread counts. Multithreaded programs generally deliver higher throughput than single-threaded programs on chip multiprocessors, but performance gains from increasing threads decrease when there is contention for shared resources. We use analytic metrics to derive local search heuristics for creating efficient multiprogrammed, multithreaded workload schedules. Programs are allocated fewer cores than requested, and scheduled to space-share the CMP to improve global throughput. Our holistic approach attempts to co-schedule programs that complement each other with respect to shared resource consumption. We find application co-scheduling for performance and energy in a resource-aware manner achieves better results than solely targeting total throughput or concurrently co-scheduling all programs. Our schedulers improve overall energy delay (E*D) by a factor of 1.5 over time-multiplexed gang scheduling.

show abstract

Section: Introductionmentioning

confidence: 83%

Section: Thread Scalingmentioning

confidence: 99%

An approach to resource-aware co-scheduling for CMPs

Bhadauria

McKee

2010

Proceedings of the 24th ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Similarly, Kunal et al [20] proposed an adaptive scheduling algorithm based on the feedback of parallelism in the application. Many other works that dynamically control number of threads are aimed at studying power performance trade-offs [11]- [13], [25], [27]. Unlike the above, Barnes et al [33] presented regression techniques to predict parallel program scaling behavior (processor count).…”

Section: Related Workmentioning

confidence: 99%

Thread reinforcer: Dynamically determining number of threads via OS level monitoring

Pusukuri

Gupta

Bhuyan

2011

2011 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

Abstract-It is often assumed that to maximize the performance of a multithreaded application, the number of threads created should equal the number of cores. While this may be true for systems with four or eight cores, this is not true for systems with larger number of cores. Our experiments with PARSEC programs on a 24-core machine demonstrate this. Therefore, dynamically determining the appropriate number of threads for a multithreaded application is an important unsolved problem. In this paper we develop a simple technique for dynamically determining appropriate number of threads without recompiling the application or using complex compilation techniques or modifying Operating System policies. We first present a scalability study of eight programs from PARSEC conducted on a 24 core Dell PowerEdge R905 server running OpenSolaris.2009.06 for numbers of threads ranging from a few threads to 128 threads. Our study shows that not only does the maximum speedup achieved by these programs vary widely (from 3.6x to 21.9x), the number of threads that produce maximum speedups also vary widely (from 16 to 63 threads). By understanding the overall speedup behavior of these programs we identify the critical Operating System level factors that explain why the speedups vary with the number of threads. As an application of these observations, we develop a framework called "Thread Reinforcer" that dynamically monitors program's execution to search for the number of threads that are likely to yield best speedups. Thread Reinforcer identifies optimal or near optimal number of threads for most of the PARSEC programs studied and as well as for SPEC OMP and PBZIP2 programs.

show abstract

“…However, off-chip memory bandwidth is considered as a fixed resource and not expected to increase with the number of core counts. Due to this many data-parallel applications becomes memory bandwidth limited and shows poor performance scaling with increasing thread counts [7], [8]. Once the off-chip bus reached its bandwidth limit, performance flattened sharply or decrease rapidly with increasing the number of threads.…”

Section: Introductionmentioning

confidence: 99%

Bandwidth Based Performance Optimization of Multi-threaded Applications

Manakkadu

Dutta

2014

2014 Sixth International Symposium on Parallel Architectures, Algorithms and Programming

View full text Add to dashboard Cite

Multiple threads running on a multi-core processor can improve the performance of a parallel application significantly. However, effective scaling of threads and cores plays a key role to achieve optimal performance because performance does not necessarily improve with increasing number of cores. Multi-threaded applications suffer due to thread synchronization, negative interference in shared memory including last level cache and main memory. Memory bandwidth also often limits the performance of a multi-threaded workload. In this paper we propose a method to achieve optimal scalability on multi-core platform and predict the bandwidth requirement of parallel workloads for a given number of threads. We employ the proposed method to improve the performance of bandwidth limited parallel applications. We find that DRAM access has various phases and use the highest bandwidth among all phases to predict the performance of a given workload on multi-threaded environment. We evaluate our proposed method using Gem5 multi-core simulator and the experimental results show that the phase based bandwidth utilization method can estimate the optimal number of threads for a given parallel workload and has low prediction error.

show abstract

Optimizing thread throughput for multithreaded workloads on memory constrained CMPs

Cited by 6 publications

References 21 publications

An approach to resource-aware co-scheduling for CMPs

An approach to resource-aware co-scheduling for CMPs

Thread reinforcer: Dynamically determining number of threads via OS level monitoring

Bandwidth Based Performance Optimization of Multi-threaded Applications

Contact Info

Product

Resources

About