Christos D. Antonopoulos scite author profile

Phylogenetic inference is considered to be one of the grand challenges in Bioinformatics due to the immense computational requirements. RAxML is currently among the fastest and most accurate programs for phylogenetic tree inference under the Maximum Likelihood (ML) criterion. First, we introduce new tree search heuristics that accelerate RAxML by a factor of 2.43 while returning equally good trees. The performance of the new search algorithm has been assessed on 18 real-world datasets comprising 148 up to 4,843 DNA sequences. We then present the implementation, optimization, and evaluation of RAxML on the IBM Cell Broadband Engine. We address the problems and provide solutions pertaining to the optimization of floating point code, control flow, communication, and scheduling of multi-level parallelism on the Cell.

show abstract

Online power-performance adaptation of multithreaded programs using hardware event-based prediction

Curtis-Maury

Dzierwa

Antonopoulos

et al. 2006

110

View full text Add to dashboard Cite

With high-end systems featuring multicore/multithreaded processors and high component density, power-aware high-performance multithreading libraries become a critical element of the system software stack. Online power and performance adaptation of multithreaded code from within user-level runtime libraries is a relatively new and unexplored area of research. We present a user-level library framework for nearly optimal online adaptation of multithreaded codes for low-power, high-performance execution. Our framework operates by regulating concurrency and changing the processors/threads configuration as the program executes. It is innovative in that it uses fast, runtime performance prediction derived from hardware event-driven profiling, to select thread granularities that achieve nearly optimal energy-efficiency points. The use of predictors substantially reduces the runtime cost of granularity control and program adaptation. Our framework achieves performance and ED 2 (energy-delay-squared) levels which are: i) comparable to or better than those of oracle-derived offline predictors; ii) significantly better than those of online predictors using exhaustive or localized linear search. The complete prediction and adaptation framework is implemented on a real multi-SMT system with Intel Hyperthreaded processors and embeds adaptation capabilities in OpenMP programs.

show abstract

Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes

Curtis-Maury

Blagojević

Antonopoulos

et al. 2008

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-Computing has recently reached an inflection point with the introduction of multi-core processors. On-chip threadlevel parallelism is doubling approximately every other year. Concurrency lends itself naturally to allowing a program to trade performance for power savings by regulating the number of active cores, however in several domains users are unwilling to sacrifice performance to save power. We present a prediction model for identifying energy-efficient operating points of concurrency in well-tuned multithreaded scientific applications, and a runtime system which uses live program analysis to optimize applications dynamically. We describe a dynamic, phase-aware performance prediction model that combines multivariate regression techniques with runtime analysis of data collected from hardware event counters to locate optimal operating points of concurrency. Using our model, we develop a prediction-driven, phase-aware runtime optimization scheme that throttles concurrency so that power consumption can be reduced and performance can be set at the knee of the scalability curve of each program phase. The use of prediction reduces the overhead of searching the optimization space while achieving near-optimal performance and power savings. A thorough evaluation of our approach shows a reduction in power consumption of 10.8% simultaneous with an improvement in performance of 17.9%, resulting in energy savings of 26.7%.

show abstract

Synthesis of Platform Architectures from OpenCL Programs

Owaida

Bellas

Daloukas

et al. 2011

View full text Add to dashboard Cite

Scalable locality-conscious multithreaded memory allocation

Schneider

Antonopoulos

Nikolopoulos

2006

View full text Add to dashboard Cite

We present Streamflow, a new multithreaded memory manager designed for low overhead, high-performance memory allocation while transparently favoring locality. Streamflow enables low overhead simultaneous allocation by multiple threads and adapts to sequential allocation at speeds comparable to that of custom sequential allocators. It favors the transparent exploitation of temporal and spatial object access locality, and reduces allocator-induced cache conflicts and false sharing, all using a unified design based on segregated heaps. Streamflow introduces an innovative design which uses only synchronization-free operations in the most common case of local allocations and deallocations, while requiring minimal, non-blocking synchronization in the less common case of remote deallocations. Spatial locality at the cache and page level is favored by eliminating small objects headers, reducing allocator-induced conflicts via contiguous allocation of page blocks in physical memory, reducing allocator-induced false sharing by using segregated heaps and achieving better TLB performance and fewer page faults via the use of superpages. Combining these locality optimizations with the drastic reduction of synchronization and latency overhead allows Streamflow to perform comparably with optimized sequential allocators and outperform-on a shared-memory system with four two-way SMT processors-four state-of-the-art multiprocessor allocators by sizeable margins in our experiments. The allocation-intensive sequential and parallel benchmarks used in our experiments represent a variety of behaviors, including mostly local object allocation-deallocation patterns and producer-consumer allocation-deallocation patterns.

show abstract

Dynamic multigrain parallelization on the cell broadband engine

Blagojević

Nikolopoulos

Stamatakis

et al. 2007

View full text Add to dashboard Cite

This paper addresses the problem of orchestrating and scheduling parallelism at multiple levels of granularity on heterogeneous multicore processors. We present policies and mechanisms for adaptive exploitation and scheduling of multiple layers of parallelism on the Cell Broadband Engine. Our policies combine event-driven task scheduling with malleable loop-level parallelism, which is exposed from the runtime system whenever task-level parallelism leaves cores idle. We present a runtime system for scheduling applications with layered parallelism on Cell and investigate its potential with RAxML, a computational biology application which infers large phylogenetic trees, using the Maximum Likelihood (ML) method. Our experiments show that the Cell benefits significantly from dynamic parallelization methods, that selectively exploit the layers of parallelism in the system, in response to workload characteristics. Our runtime environment outperforms naive parallelization and scheduling based on MPI and Linux by up to a factor of 2.6. We are able to execute RAxML on one Cell four times faster than on a dual-processor system with Hyperthreaded Xeon processors, and 5-10% faster than on a single-processor system with a dualcore, quad-thread IBM Power5 processor.

show abstract

GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates

Parasyris

Tziantzoulis

Antonopoulos

et al. 2014

View full text Add to dashboard Cite

Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors

McGregor

Antonopoulos

Nikolopoulos

View full text Add to dashboard Cite

With the latest high-end computing nodes combining shared-memory multiprocessing with hardware multithreading, new scheduling policies are necessary for workloads consisting of multithreaded applications. The use of hybrid multiprocessors presents schedulers with the problem of job pairing, i.e. deciding which specific jobs can share each processor with minimum performance penalty, by running on different execution contexts. Therefore, scheduling policies are expected to decide not only which job mix will execute simultaneously across the processors, but also which jobs can be combined within each processor. This paper addresses the problem by introducing new scheduling policies that use run-time performance information to identify the best mix of threads to run across processors and within each processor. Scheduling of threads across processors is driven by the memory bandwidth utilization of the threads, whereas scheduling of threads within processors is driven by one of three metrics: bus transaction rate per thread, stall cycle rate per thread, or outermost level cache miss rate per thread. We have implemented and experimentally evaluated these policies on a real multiprocessor server with Intel Hyperthreaded processors. The policy using bus transaction rate for thread pairing achieves an average 13.4% and a maximum 28.7% performance improvement over the Linux scheduler. The policy using stall cycle rate for thread pairing achieves an average 9.5% and a maximum 18.8% performance improvement. The average and maximum performance gains of the policy using cache miss rate for thread pairing are 7.2% and 23.6% respectively.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.