Bottleneck identification and scheduling in multithreaded applications

Joao, José A.; Suleman, Muhammad; Mutlu, Onur; Patt, Yale N.

doi:10.1145/2248487.2151001

Cited by 11 publications

(14 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…T parallel is the parallel portion, which can be eaten away by spawning more threads. For online usage, T serial and T parallel can be accurately obtained by loop peeling method [12] that finishes executions repeatedly and learns the ratio of the serial portion, or, instrumentation technique [24] that inserts bottleneck-identification instructions at the entry and exit of serial and parallel portions to record elapsed cycles. As the distinctive part of α model, multithreading overhead T penalty indeed reveals the bottleneck of scale-out speedup [24].…”

Section: Scale-out Speedup: αmentioning

confidence: 99%

“…For online usage, T serial and T parallel can be accurately obtained by loop peeling method [12] that finishes executions repeatedly and learns the ratio of the serial portion, or, instrumentation technique [24] that inserts bottleneck-identification instructions at the entry and exit of serial and parallel portions to record elapsed cycles. As the distinctive part of α model, multithreading overhead T penalty indeed reveals the bottleneck of scale-out speedup [24]. It is determined by 1) synchronization contentions, such as inter-thread locks and barriers, and 2) communication contentions, happened on communication-related hardware resources of LLC, memory controller, and memory bus etc.…”

Section: Scale-out Speedup: αmentioning

confidence: 99%

“…We choose applications with largest amounts of lock and barrier synchronization. No matter what type of synchronization happens, the penalty can be tracked and returned by the bottleneck-identifying instructions (BottleneckCall, BottleneckReturn and BottleneckWait) already used in [24]. Basically, the penalty correlates linearly with thread number, even though with different slope owing to different intensity of locks and barriers.…”

Section: Scale-out Speedup: αmentioning

confidence: 99%

“…In α model, SPKI (k 1 ) represents the accumulated synchronization-induced waiting cycles per kiloinstructions. It is caused by both locks and barriers, ob-tained by the bottleneck-identifying instructions built in source code [24]. In each interval, BottleneckCall and BottleneckReturn identify the regions of the locks and barriers, then BottleneckWait accumulates the waiting penalty of each thread.…”

Section: Model Implementationmentioning

confidence: 99%

See 3 more Smart Citations

An Analytical Framework for Estimating Scale-Out and Scale-Up Power Efficiency of Heterogeneous Manycores

Yan

Han

et al. 2016

IEEE Trans. Comput.

View full text Add to dashboard Cite

Heterogeneous manycore architectures have shown to be highly promising to boost power efficiency through two independent ways: 1) enabling massive thread-level parallelism, called "scale-out" approach, and 2) enabling thread migration between heterogeneous cores, called "scale-up" approach. How to accurately model the profitability of power efficiency of the two ways, particularly in an analytical and computational-effective manner, is essential to reap the power efficiency of such architectures. We propose a comprehensive analytical model to predict the power efficiency from the two independent ways. Given power efficiency is measured by performance per watt, this model is composed of a performance and a power model. The performance model is built by two orthogonal functions α and β. Function α describes the scale-out speedup from multithreading; function β presents the scale-up speedup from core heterogeneity. Thus, the performance model can clearly capture the overall speedup of any multithreading and thread-to-core mapping strategies. The power model predicts the power of corresponding scale-out and scale-up configurations. It simultaneously captures the power variations caused by thread synchronization and thread migration between heterogeneous cores. We build both performance and power model in an analytical way and keep the computational complexity in mind. This merit leads to a suit of comprehensive and low-complexity models for runtime management. These models are validated on large-scale heterogeneous manycore architecture with full-system simulations. For performance prediction, the average error is below 12%, lower than that of the state-of-the-art methods. For power prediction, the average error is 7.74%. On top of the models, we introduce two heuristic scheduling algorithms, performance-oriented MAX-P and power efficiency-oriented MAX-E, to demonstrate the usage of these models. The results show that MAX-P outperforms the state-of-the-art methods by 18% in performance averagely; MAX-E outperforms the baseline by 70% in power efficiency on average.

show abstract

Section: Scale-out Speedup: αmentioning

confidence: 99%

Section: Scale-out Speedup: αmentioning

confidence: 99%

Section: Scale-out Speedup: αmentioning

confidence: 99%

Section: Model Implementationmentioning

confidence: 99%

See 2 more Smart Citations

An Analytical Framework for Estimating Scale-Out and Scale-Up Power Efficiency of Heterogeneous Manycores

Yan

Han

et al. 2016

IEEE Trans. Comput.

View full text Add to dashboard Cite

show abstract

“…First, the scheduling unit in existing techniques is either interval based (fixed-instruction interval [70,72,73,78,81] or fixed-time interval [57,65,79,88,89]) or a code segment (e.g., critical sections, lagging threads, application kernels [54,68,69,87]). The scheduling unit in the event-based scheduling is the event handler in interactive mobile Web applications.…”

Section: Related Workmentioning

confidence: 99%

Event-based scheduling for energy-efficient QoS (eQoS) in mobile Web applications

Zhu

Halpern

Reddi

2015

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Mobile Web applications have become an integral part of our society. They pose a high demand for application quality of service (QoS). However, the energy-constrained nature of mobile devices makes optimizing for QoS difficult. Prior art on energy efficiency optimizations has only focused on the trade-off between raw performance and energy consumption, ignoring the application QoS characteristics. In this paper, we propose the concept of energy-efficient QoS (eQoS) to capture the trade-off between QoS and energy consumption. Given the fundamental event-driven nature of mobile Web applications, we further propose event-based scheduling as an optimization framework for eQoS. The event-based scheduling automatically reasons about users' QoS requirements, and accurately slacks the events' execution time to save energy without violating end users' experience. We demonstrate a working prototype using the Google Chromium and V8 framework on the Samsung Exynos 5410 SoC (used in the Galaxy S4 smartphone). Based on real hardware and software measurements, we achieve 41.2% energy saving with only 0.4% of QoS violations perceptible to end users. 137978-1-4799-8930-0/15/$31.00 ©2015 IEEE

show abstract

Time Donating Barrier for efficient task scheduling in competitive multicore systems

Peng

Jin

2016

Future Generation Computer Systems

View full text Add to dashboard Cite

Nowadays, co-locating multithreaded applications on a multicore system has increasingly become a common case in cloud data centers, where multiple threads generally compete for computing resources. These competitive environments may suffer problems of system throughput and fairness caused by barrier operations in multithreaded applications. This is because most implementations of the barrier synchronization are based on the spin-thenblock mechanism in which spinning-waiting threads probably waste computing resources and relinquish cores to other co-running applications after they are blocked. This paper wants to find a new and intuitive way to improve the efficiency of barrier in competitive environments, and answer the question: Can we leverage the timeslices of waiting threads to accelerate barrier operations? Targeting this question, we propose a novel barrier synchronization mechanism named Tidon (Time Donating Barrier). The basic idea of Tidon is to donate the timeslices of waiting threads to their preempted, laggard siblings in order to accelerate barrier operations, different from traditional static spinning and blocking. We implement Tidon based on the GNU OpenMP runtime library (libgomp) and Linux kernel with new, lightweight system calls. Our evaluation with various sets of co-running applications demonstrates that the advantages of Tidon include (1) alleviating the performance degradation of barrier-intensive applications (e.g. improving the performance by up to a factor of 17.9 and 2.3 compared to the default barrier implementation of OpenMP in Completely Fair Scheduler and Balance

show abstract

Bottleneck identification and scheduling in multithreaded applications

Cited by 11 publications

References 39 publications

An Analytical Framework for Estimating Scale-Out and Scale-Up Power Efficiency of Heterogeneous Manycores

An Analytical Framework for Estimating Scale-Out and Scale-Up Power Efficiency of Heterogeneous Manycores

Event-based scheduling for energy-efficient QoS (eQoS) in mobile Web applications

Time Donating Barrier for efficient task scheduling in competitive multicore systems

Contact Info

Product

Resources

About