DuctTeip: An efficient programming model for distributed task-based parallel computing

Zafari, Afshin; Larsson, Elisabeth; Tillenius, Martin

doi:10.1016/j.parco.2019.102582

Cited by 18 publications

(16 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We found Dask to provide a good balance between simplicity, portability and performance, and chose to use it to implement the orchestrator that ships with Orchestral. There are however many alternatives available [4], [29], [5], [34], [23], [33], [18], which could potentially help us to push performance for lowlatency requirements on distributed infrastructures. In fact, Orchestral could be used to create a comparative benchmark of all these libraries.…”

Section: Discussionmentioning

confidence: 99%

Orchestral: A Lightweight Framework for Parallel Simulations of Cell-Cell Communication

Coulier

Hellander

2018

2018 IEEE 14th International Conference on E-Science (E-Science)

View full text Add to dashboard Cite

We develop a modeling and simulation framework capable of massively parallel simulation of multicellular systems with spatially resolved stochastic kinetics in individual cells. By the use of operator-splitting we decouple the simulation of reaction-diffusion kinetics inside the cells from the simulation of molecular cell-cell interactions occurring on the boundaries between cells. This decoupling leverages the inherent scale separation in the underlying model to enable highly horizontally scalable parallel simulation, suitable for simulation on heterogeneous, distributed computing infrastructures such as public and private clouds. Thanks to its modular structure, our frameworks makes it possible to couple just any existing single-cell simulation software together with any cell signaling simulator. We exemplify the flexibility and scalability of the framework by using the popular single-cell simulation software eGFRD to construct and simulate a multicellular model of Notch-Delta signaling over OpenStack cloud infrastructure provided by the SNIC Science Cloud.

show abstract

Section: Discussionmentioning

confidence: 99%

Orchestral: A Lightweight Framework for Parallel Simulations of Cell-Cell Communication

Coulier

Hellander

2018

2018 IEEE 14th International Conference on E-Science (E-Science)

View full text Add to dashboard Cite

show abstract

“…Given these limitations on the widely used structured SWAN grid approach, SWAN grids will almost exclusively be deemed as a low spatial computational demand model. Small tasks create a sharp drop in performance via the Intel C++ compiler due to the "work stealing" algorithm, aimed at balancing out the computational load between threads (Zafari et al, 2019) . In this scenario, the threads compete against each other resulting in an unproductive simulation.…”

Section: Methodology and Backgroundmentioning

confidence: 99%

Parallel computing efficiency of SWAN

Rautenbach

Mullarney

Bryan

2020

Preprint

View full text Add to dashboard Cite

Abstract. Effective and accurate ocean and coastal wave predictions are necessary for engineering, safety and recreational purposes. Refining predictive capabilities is increasingly critical to reduce the uncertainties faced with a changing global wave climatology. Simulating WAves in the Nearshore (SWAN) is a widely used spectral wave modelling tool employed by coastal engineers and scientists, including for operational wave forecasting purposes. Fore- and hindcasts can span hours to decades and a detailed understanding of the computational efficiencies is required to design optimized operational protocols and hindcast scenarios. To date, there exists limited knowledge on the relationship between the size of a SWAN computational domain and the optimal amount of parallel computational threads required to execute a simulation effectively. To test this, a hindcast cluster of 28 computational threads (1 node) was used to determine the computation efficiencies of a SWAN model configuration for southern Africa. The model extent and resolution emulate the current operational wave forecasting configuration developed by the South African Weather Service (SAWS). We implemented and compared both OpenMP and the Message Passing Interface (MPI) distributing memory architectures. Three sequential simulations (corresponding to typical grid cell numbers) were compared to various permutations of parallel computations via the speed-up ratio, time saving ratio and efficiency tests. Generally, a computational node configuration of 6 threads produced the most effective computational set-up based on wave hindcasts of one-week duration. The use of more than 20 threads resulted in a decrease in speed-up ratio for the smallest computation domain, owing to the increased sub-domain communication times for limited domain sizes.

show abstract

“…All of the operations in the NESA algorithm are dense matrix-vector products, with the same computational intensity of 2 flop/double. For modern multicore architectures, a computational intensity of 30-40 is needed in order to balance bandwidth capacity and floating point performance, see for example the trade-offs for the Tintin and Rackham systems at UPPMAX, Uppsala University, calculated in [64]. This means that we need to exploit data locality (work on data that is cached locally) in order to overcome bandwidth limitations and scale to the full number of available cores.…”

Section: Specific Properties Of the Nesa Algorithmmentioning

confidence: 99%

“…The ongoing trend in cluster hardware is an increasing number of cores per computational node. When scaling to large numbers of cores, it is hard to fully exploit the computational resources using a pure MPI implementation, due to the rapid increase in the number of inter-node messages with the number of MPI processes for communication heavy algorithms [64]. As is pointed out in [35], a hybrid parallelization with MPI at the distributed level and threads within the computational nodes is more likely to perform well.…”

Section: State Of the Artmentioning

confidence: 99%

“…When performing large scale three-dimensional simulations, it becomes necessary to use distributed computer systems, and hence distributed parallel programming (or a partitioned global address space (PGAS) model). In [64] it was shown that a hierarchical task-parallel programming model was beneficial for the distributed implementation. Larger tasks are communicated between computational nodes, and then split into subtasks that are executed in parallel within each node.…”

Section: Recommendations For a Task Parallel 3-d Implementationmentioning

confidence: 99%

See 1 more Smart Citation

Parallelization of Hierarchical Matrix Algorithms for Electromagnetic Scattering Problems

Larsson¹,

Zafari²,

Righero³

et al. 2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Numerical solution methods for electromagnetic scattering problems lead to large systems of equations with millions or even billions of unknown variables. The coefficient matrices are dense, leading to large computational costs and storage requirements if direct methods are used. A commonly used technique is to instead form a hierarchical representation for the parts of the matrix that corresponds to far-field interactions. The overall computational cost and storage requirements can then be reduced to O(N log N). This still corresponds to a large-scale simulation that requires parallel implementation. The hierarchical algorithms are rather complex, both regarding data dependencies and communication patterns, making parallelization non-trivial. In this chapter, we describe two classes of algorithms in some detail, we provide a survey of existing solutions, we show results for a proof-of-concept implementation, and we provide various perspectives on different aspects of the problem. The list of authors is organized into three subgroups, Larsson and Zafari (coordination and proof-of-concept implementation), Righero, Francavilla, Giordanengo, Vipiana, and Vecchi (definition of and expertise relating to the application), Kessler, Ancourt, and Grelck (perspectives and parallel expertise).

show abstract

DuctTeip: An efficient programming model for distributed task-based parallel computing

Cited by 18 publications

References 37 publications

Orchestral: A Lightweight Framework for Parallel Simulations of Cell-Cell Communication

Orchestral: A Lightweight Framework for Parallel Simulations of Cell-Cell Communication

Parallel computing efficiency of SWAN

Parallelization of Hierarchical Matrix Algorithms for Electromagnetic Scattering Problems

Contact Info

Product

Resources

About