A sensory grammar for inferring behaviors in sensor networks

We introduce Deep500: the first customizable benchmarking infrastructure that enables fair comparison of the plethora of deep learning frameworks, algorithms, libraries, and techniques. The key idea behind Deep500 is its modular design, where deep learning is factorized into four distinct levels: operators, network processing, training, and distributed training. Our evaluation illustrates that Deep500 is customizable (enables combining and benchmarking different deep learning codes) and fair (uses carefully selected metrics). Moreover, Deep500 is fast (incurs negligible overheads), verifiable (offers infrastructure to analyze correctness), and reproducible. Finally, as the first distributed and reproducible benchmarking system for deep learning, Deep500 provides software infrastructure to utilize the most powerful supercomputers for extreme-scale workloads.

show abstract

A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations

Ziogas

Ben-Nun

Fernández

et al. 2019

View full text Add to dashboard Cite

The computational efficiency of a state of the art ab initio quantum transport (QT) solver, capable of revealing the coupled electrothermal properties of atomically-resolved nano-transistors, has been improved by up to two orders of magnitude through a data centric reorganization of the application. The approach yields coarseand fine-grained data-movement characteristics that can be used for performance and communication modeling, communicationavoidance, and dataflow transformations. The resulting code has been tuned for two top-6 hybrid supercomputers, reaching a sustained performance of 85.45 Pflop/s on 4,560 nodes of Summit (42.55% of the peak) in double precision, and 90.89 Pflop/s in mixed precision. These computational achievements enable the restructured QT simulator to treat realistic nanoelectronic devices made of more than 10,000 atoms within a 14× shorter duration than the original code needs to handle a system with 1,000 atoms, on the same number of CPUs/GPUs and with the same physical accuracy.

show abstract

On the parallel I/O optimality of linear algebra kernels

Kwasniewski

Kabic

Ben-Nun

et al. 2021

View full text Add to dashboard Cite

Matrix factorizations are among the most important building blocks of scientific computing. However, state-of-the-art libraries are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra kernels, and then utilize its insights to derive Cholesky and LU schedules, both communicating N 3 /(P √ M) elements per processor, where M is the local memory size. The empirical results match our theoretical analysis: our implementations communicate significantly less than Intel MKL, SLATE, and the asymptotically communication-optimal CANDMC and CAPITAL libraries. Our code outperforms these state-of-the-art libraries in almost all tested scenarios, with matrix sizes ranging from 2,048 to 524,288 on up to 512 CPU nodes of the Piz Daint supercomputer, decreasing the time-to-solution by up to three times. Our code is ScaLAPACK-compatible and available as an open-source library.

show abstract

A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning

Ben-Nun¹,

Besta²,

Huber³

et al. 2019

Preprint

View full text Add to dashboard Cite

Stateful dataflow multigraphs

Ben-Nun

Licht

Ziogas

et al. 2019

View full text Add to dashboard Cite

The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating program definition from its optimization. By combining fine-grained data dependencies with high-level control-flow, SDFGs are both expressive and amenable to program transformations, such as tiling and double-buffering. These transformations are applied to the SDFG in an interactive process, using extensible pattern matching, graph rewriting, and a graphical user interface. We demonstrate SDFGs on CPUs, GPUs, and FPGAs over various motifs -from fundamental computational kernels to graph analytics. We show that SDFGs deliver competitive performance, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code.HPC programmers have long sacrificed ease of programming and portability for achieving better performance. This mindset was established at a time when computer nodes had a single processor/core and were programmed with C/Fortran and MPI. The last decade, witnessing the end of Dennard scaling and Moore's law, brought a flurry of new technologies into the compute nodes. Those range from simple multi-core and manycore CPUs to heterogeneous GPUs and specialized FPGAs. To support those architectures, the complexity of OpenMP's specification grew by more than an order of magnitude from 63 pages in OpenMP 1.0 to 666 pages in OpenMP 5.0. This one example illustrates how (performance) programming complexity shifted from network scalability to node

show abstract

Optimizing the data movement in quantum transport simulations via data-centric parallel programming

Ziogas

Ben-Nun

Fernández

et al. 2019

View full text Add to dashboard Cite

Designing efficient cooling systems for integrated circuits (ICs) relies on a deep understanding of the electro-thermal properties of transistors. To shed light on this issue in currently fabricated Fin-FETs, a quantum mechanical solver capable of revealing atomicallyresolved electron and phonon transport phenomena from firstprinciples is required. In this paper, we consider a global, datacentric view of a state-of-the-art quantum transport simulator to optimize its execution on supercomputers. The approach yields coarseand fine-grained data-movement characteristics, which are used for performance and communication modeling, communicationavoidance, and data-layout transformations. The transformations are tuned for the Piz Daint and Summit supercomputers, where each platform requires different caching and fusion strategies to perform optimally. The presented results make ab initio device simulation enter a new era, where nanostructures composed of over 10,000 atoms can be investigated at an unprecedented level of accuracy, paving the way for better heat management in next-generation ICs.

show abstract

Productivity, portability, performance

Ziogas

Schneider

Ben-Nun

et al. 2021

View full text Add to dashboard Cite

Pebbles, Graphs, and a Pinch of Combinatorics

Kwasniewski

Ben-Nun

Gianinazzi

et al. 2021

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Alexandros Nikolaos Ziogas

A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning

A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations

On the parallel I/O optimality of linear algebra kernels

A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning

Stateful dataflow multigraphs

Optimizing the data movement in quantum transport simulations via data-centric parallel programming

Productivity, portability, performance

Pebbles, Graphs, and a Pinch of Combinatorics

Contact Info

Product

Resources

About