Ananta Tiwari scite author profile

A discussion of many of the recently implemented features of GAMESS (General Atomic and Molecular Electronic Structure System) and LibCChem (the C++ CPU/GPU library associated with GAMESS) is presented. These features include fragmentation methods such as the fragment molecular orbital, effective fragment potential and effective fragment molecular orbital methods, hybrid MPI/OpenMP approaches to Hartree–Fock, and resolution of the identity second order perturbation theory. Many new coupled cluster theory methods have been implemented in GAMESS, as have multiple levels of density functional/tight binding theory. The role of accelerators, especially graphical processing units, is discussed in the context of the new features of LibCChem, as it is the associated problem of power consumption as the power of computers increases dramatically. The process by which a complex program suite such as GAMESS is maintained and developed is considered. Future developments are briefly summarized.

show abstract

A scalable auto-tuning framework for compiler optimization

Tiwari

et al. 2009

View full text Add to dashboard Cite

We describe a scalable and general-purpose framework for auto-tuning compiler-generated code. We combine Active Harmony's parallel search backend with the CHiLL compiler transformation framework to generate in parallel a set of alternative implementations of computation kernels and automatically select the one with the best-performing implementation. The resulting system achieves performance of compiler-generated code comparable to the fully automated version of the ATLAS library for the tested kernels. Performance for various kernels is 1.4 to 3.6 times faster than the native Intel compiler without search. Our search algorithm simultaneously evaluates different combinations of compiler optimizations and converges to solutions in only a few tens of search-steps.

show abstract

Online Adaptive Code Generation and Tuning

Tiwari

Hollingsworth

2011

View full text Add to dashboard Cite

Abstract-In this paper, we present a runtime compilation and tuning framework for parallel programs. We extend our prior work on our auto-tuner, Active Harmony, for tunable parameters that require code generation (for example, different unroll factors). For such parameters, our auto-tuner generates and compiles new code on-the-fly. Effectively, we merge traditional feedback directed optimization and just-in-time compilation. We show that our system can leverage available parallelism in today's HPC platforms by evaluating different code-variants on different nodes simultaneously. We evaluate our system on two parallel applications and show that our system can improve runtime execution by up to 46% compared to the original version of the program.

show abstract

Green Queue: Customized Large-Scale Clock Frequency Scaling

Tiwari

Laurenzano

Peraza

et al. 2012

View full text Add to dashboard Cite

Abstract-We examine the scalability of a set of techniques related to Dynamic Voltage-Frequency Scaling (DVFS) on HPC systems to reduce the energy consumption of scientific applications through an application-aware analysis and runtime framework, Green Queue. Green Queue supports making CPU clock frequency changes in response to intra-node and internode observations about application behavior. Our intra-node approach reduces CPU clock frequencies and therefore power consumption while CPUs lacks computational work due to inefficient data movement. Our inter-node approach reduces clock frequencies for MPI ranks that lack computational work.We investigate these techniques on a set of large scientific applications on 1024 cores of Gordon, an Intel Sandybridgebased supercomputer at the San Diego Supercomputer Center. Our optimal intra-node technique showed an average measured energy savings of 10.6% and a maximum of 21.0% over regular application runs. Our optimal inter-node technique showed an average 17.4% and a maximum of 31.7% energy savings.

show abstract

Multi Objective Optimization of HPC Kernels for Performance, Power, and Energy

Balaprakash

Tiwari

Wild

2014

View full text Add to dashboard Cite

Auto-tuning for Energy Usage in Scientific Applications

Tiwari

Laurenzano

Carrington

et al. 2012

View full text Add to dashboard Cite

Abstract. The power wall has become a dominant impeding factor in the realm of exascale system design. It is therefore important to understand how to most effectively create application software in order to minimize its power usage while maintaining satisfactory levels of performance. In this work, we use existing software and hardware facilities in order to tune applications to minimize for several combinations of power and performance. The tuning is done with respect to software level performance-related tunables (cache tiling factors and loop unrolling factors) as well as for processor clock frequency. These tunable parameters are explored via an offline search in order to find the parameter combinations that are optimal with respect to performance (or delay, D), energy (E), energy×delay (E × D) and energy×delay×delay (E × D 2 ). These searches are employed on a parallel application that solves Poisson's equation using stencil computations. Stencil (nearestneighbor) computations are very common operations in today's scientific applications. We show that the parameter configuration that minimizes energy consumption can save, on average, 5.4% energy with a performance loss of 4% when compared to the configuration that minimizes runtime. Furthermore, with the work presented in this paper, we provide evidence for the existence of opportunities to auto-tune for energy in parallel applications.

show abstract

Parallel Parameter Tuning for Applications with Performance Variability

Tabatabaee

Tiwari

Hollingsworth

2005

View full text Add to dashboard Cite

The case for colocation of high performance computing workloads

Breslow

Porter

Tiwari

et al. 2013

Concurrency and Computation

View full text Add to dashboard Cite

The current state of practice in supercomputer resource allocation places jobs from different users on disjoint nodes both in terms of time and space. While this approach largely guarantees that jobs from different users do not degrade one another's performance, it does so at high cost to system throughput and energy efficiency. This focused study presents job striping, a technique that significantly increases performance over the current allocation mechanism by colocating pairs of jobs from different users on a shared set of nodes. To evaluate the potential of job striping in large-scale environments, the experiments are run at the scale of 128 nodes on the state-of-the-art Gordon supercomputer. Across all pairings of 1024 process network-attached storage parallel benchmarks, job striping increases mean throughput by 26% and mean energy efficiency by 22%. On pairings of the real applications Gyrokinetic Toroidal Code (GTC), Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS), and MIMD Lattice Computation (MILC) at equal scale, job striping improves average throughput by 12% and mean energy efficiency by 11%. In addition, the study provides a simple set of heuristics for avoiding low performing application pairs. 239 Figure 3. Increase in system throughput (STP) over compact when applying job spreading and striping to the NAS parallel benchmarks and GTC, LAMMPS, and MILC. Figure 3 shows the performance results for the first set of experiments with the NPBs and the second set of experiments with GTC, LAMMPS, and MILC. For the NPBs, the mean performance increase from job spreading is 50%. If one examines striped coschedules of non-identical NPBs, the average performance increase is 26%. If one selects the best running mate other than embarrassingly parallel (EP) for each benchmark, then the average increase in performance is 36%. We choose to exclude EP because EP is minimally contentious. Each EP task's working set fits entirely in the private levels of cache, and EP spends very little time in active communication. Because of these traits, EP universally causes every application that it stripes with to achieve its best striped performance. Thus, for the sake of fairness and realism, we exclude these results from the 'Best' average. For the NPBs, random striping yields about 50% of the performance benefit of job spreading, and striping each job with its best running mate provides 70% of the performance benefit of spreading. This trend continues for real applications as well. For GTC, LAMMPS, and MILC, job spreading increases throughput by 23% and the mean heterogeneous striping and the mean best striping improve performance by 12% and 16%, respectively. PERFORMANCE RESULTS Compact versus spreading versus striping Network-attached storage parallel benchmarksIn this section, we examine the increase in collective throughput and energy efficiency for pairs of striped NPBs. The results are presented in Figure 4. For completeness, we run all pairwise combinations. This includes both heterogeneous pairings ...

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ananta Tiwari

Recent developments in the general atomic and molecular electronic structure system

A scalable auto-tuning framework for compiler optimization

Online Adaptive Code Generation and Tuning

Green Queue: Customized Large-Scale Clock Frequency Scaling

Multi Objective Optimization of HPC Kernels for Performance, Power, and Energy

Auto-tuning for Energy Usage in Scientific Applications

Parallel Parameter Tuning for Applications with Performance Variability

The case for colocation of high performance computing workloads

Contact Info

Product

Resources

About