Alexandru Calotoiu scite author profile

Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made-a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. In this paper, we show how both coverage and speed of this scalability analysis can be substantially improved. Generating an empirical performance model automatically for each part of a parallel program, we can easily identify those parts that will reduce performance at larger core counts. Using a climate simulation as an example, we demonstrate that scalability bugs are not confined to those routines usually chosen as kernels.

show abstract

Fast Multi-parameter Performance Modeling

Calotoiu

Beckinsale

Earl

et al. 2016

View full text Add to dashboard Cite

Learning Cost-Effective Sampling Strategies for Empirical Performance Modeling

Ritter

Calotoiu

Rinke

et al. 2020

View full text Add to dashboard Cite

Identifying scalability bottlenecks in parallel applications is a vital but also laborious and expensive task. Empirical performance models have proven to be helpful to find such limitations, though they require a set of experiments in order to gain valuable insights. Therefore, the experiment design determines the quality and cost of the models. Extra-P is an empirical modeling tool that uses small-scale experiments to assess the scalability of applications. Its current version requires an exponential number of experiments per model parameter. This makes the creation of empirical performance models very expensive, and in some situations even impractical. In this paper, we propose a novel parameter-value selection heuristic, which functions as a guideline for the experiment design, leveraging sparse performance-modeling, a technique that only needs a polynomial number of experiments per model parameter. Using synthetic analysis and data from three different case studies, we show that our solution reduces the average modeling costs by about 85% while retaining 92% of the model accuracy.

show abstract

Bridging Control-Centric and Data-Centric Optimization

Ben-Nun

Ates

Calotoiu

et al. 2023

View full text Add to dashboard Cite

How Many Threads will be too Many? On the Scalability of OpenMP Implementations

Iwainsky

Shudler

Calotoiu

et al. 2015

View full text Add to dashboard Cite

A RISC-V in-network accelerator for flexible high-performance low-power packet processing

Girolamo

Kurth

Calotoiu

et al. 2021

View full text Add to dashboard Cite

The capacity of offloading data and control tasks to the network is becoming increasingly important, especially if we consider the faster growth of network speed when compared to CPU frequencies. In-network compute alleviates the host CPU load by running tasks directly in the network, enabling additional computation/communication overlap and potentially improving overall application performance. However, sustaining bandwidths provided by next-generation networks, e.g., 400 Gbit/s, can become a challenge. sPIN is a programming model for in-NIC compute, where users specify handler functions that are executed on the NIC, for each incoming packet belonging to a given message or flow. It enables a CUDA-like acceleration, where the NIC is equipped with lightweight processing elements that process network packets in parallel. We investigate the architectural specialties that a sPIN NIC should provide to enable high-performance, low-power, and flexible packet processing. We introduce PsPIN, a first open-source sPIN implementation, based on a multi-cluster RISC-V architecture and designed according to the identified architectural specialties. We investigate the performance of PsPIN with cycle-accurate simulations, showing that it can process packets at 400 Gbit/s for several use cases, introducing minimal latencies (26 ns for 64 B packets) and occupying a total area of 18.5 mm 2 (22 nm FDSOI).

show abstract

Off-Road Performance Modeling – How to Deal with Segmented Data

Ilyas

Calotoiu

Wolf

2017

View full text Add to dashboard Cite

Exploring the Performance Envelope of the LLL Algorithm

Burger

Bischof

Calotoiu

et al. 2018

View full text Add to dashboard Cite

12 3 4 5

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.