Timothy Creech scite author profile

Timothy Creech

4Publications

24Citation Statements Received

45Citation Statements Given

How they've been cited

How they cite others

Affiliations

University of Maryland, College Park

Publications

Order By: Most citations

Efficient multiprogramming for multicores with SCAF

Creech

Kotha

Barua

2013

View full text Add to dashboard Cite

As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run [1], enable sophisticated and flexible resource management.Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed run-time environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend upon profiling applications ahead of time in order to make good decisions about allocations, or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This paper presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution which supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications, without requiring application modification or recompilation.In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters, and demonstrate its effectiveness in aiding allocation decisions.We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme.We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning -the best existing competing scheme in the literature. If the sum of speedups with SCAF is within 5% of equipartitioning (i.e., improvement factor is 0.95X < improvement factor in sum of speedups < 1.05X), then we deem SCAF to break even. Less than 0.95X is considered a slowdown; greater than 1.05X is an improvement. We found that SCAF improves on equipartitioning on 4 out of 5 machines, breaking even or improving in 80-89% of pairs and showing a mean improvement of 1.11-1.22X for benchmark pairs for which it shows an improvement, depending on the machine.Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end-users today. SCAF improves or breaks even on the unmodified OpenMP runtimes for all 5 machines in 72-100% of pairs, with a mean improvement of 1.27-1.7X, depending on the machine.

show abstract

Affine Parallelization of Loops with Run-Time Dependent Bounds from Binaries

Kotha

Anand

Creech

et al. 2014

View full text Add to dashboard Cite

An automatic parallelizer is a tool that converts serial code to parallel code. This is an important tool because most hardware today is parallel and manually rewriting the vast repository of serial code is tedious and error prone. We build an automatic parallelizer for binary code, i.e. a tool which converts a serial binary to a parallel binary. It is important because: (i) most serial legacy code has no source code available; (ii) it is compatible with all compilers and languages. In the past binary automatic parallelization techniques have been developed and researchers have presented results on small kernels from polybench. These techniques are a good start; however they are far from parallelizing larger codes from the SPEC2006 and OMP2001 benchmark suites which are representative of real world codes. The main limitation of past techniques is the assumption that loop bounds are statically known to calculate loop dependencies. However, in larger codes loop bounds are only known at run-time; hence loop dependencies calculated statically are overly conservative making binary parallelization ineffective. In this paper we present a novel algorithm that enhancing past techniques significantly by guessing the most likely loop bounds using only the memory expressions present in that loop. It then inserts run-time checks to see if these guesses were indeed correct and if correct executes the parallel version of the loop, else the serial version executes. These techniques are applied to the large affine benchmarks in SPEC2006 and OMP2001 and unlike previous methods the speedups from binary are as good as from source. We also present results on the number of loops parallelized directly from a binary with and without this algorithm. Among the 8 affine benchmarks among these suites, the best existing binary parallelization method achieves an average speedup of 1.74X, whereas our method achieves a speedup of 3.38X. This is close to the speedup from source code of 3.15X.

show abstract

Transparently Space Sharing a Multicore Among Multiple Processes

Creech

Barua

2016

ACM Trans. Parallel Comput.

View full text Add to dashboard Cite

As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed runtime environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend on profiling applications ahead of time to make good decisions about allocations or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This article presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution that supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications without requiring application modification or recompilation. In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters and demonstrate its effectiveness in aiding allocation decisions. We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning—the best existing competing scheme in the literature. We found that SCAF improves on equipartitioning on four out of five machines, showing a mean improvement factor in sum of speedups of 1.04 to 1.11x for benchmark pairs, depending on the machine, and 1.09x on average. Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end users today. SCAF improves on the unmodified OpenMP runtimes for all five machines, with a mean improvement of 1.08 to 2.07x, depending on the machine, and 1.59x on average.

show abstract

Affine Parallelization Using Dependence and Cache Analysis in a Binary Rewriter

Kotha

Anand

Creech

et al. 2015

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Timothy Creech

Efficient multiprogramming for multicores with SCAF

Affine Parallelization of Loops with Run-Time Dependent Bounds from Binaries

Transparently Space Sharing a Multicore Among Multiple Processes

Affine Parallelization Using Dependence and Cache Analysis in a Binary Rewriter

Contact Info

Product

Resources

About