cuHinesBatch: Solving Multiple Hines systems on GPUs Human Brain Project * *This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 720270 (HBP SGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015-65316-P) and the Departament d'Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d'Execució

Valero-Lara, Pedro; Martínez-Pérez, Ivan; Peña, Antonio J.; Martorell, Xavier; Sirvent, Raül; Labarta, Jesús

doi:10.1016/j.procs.2017.05.145

Cited by 15 publications

(12 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…But, with few cells per morphological type, Gaussian elimination suffers from non-contiguous layout of parents relative to a group of nodes. This results in irregular, strided memory accesses and hence poor performance (Valero-Lara et al, 2017). To address this, two alternative node orderings schemes, Interleaved layout and Constant Depth layout, are implemented as illustrated in Figures 6D,E.…”

Section: Optimizationsmentioning

confidence: 99%

CoreNEURON : An Optimized Compute Engine for the NEURON Simulator

Kumbhar

Hines

Fouriaux

et al. 2019

Front. Neuroinform.

View full text Add to dashboard Cite

The NEURON simulator has been developed over the past three decades and is widely used by neuroscientists to model the electrical activity of neuronal networks. Large network simulation projects using NEURON have supercomputer allocations that individually measure in the millions of core hours. Supercomputer centers are transitioning to next generation architectures and the work accomplished per core hour for these simulations could be improved by an order of magnitude if NEURON was able to better utilize those new hardware capabilities. In order to adapt NEURON to evolving computer architectures, the compute engine of the NEURON simulator has been extracted and has been optimized as a library called CoreNEURON. This paper presents the design, implementation, and optimizations of CoreNEURON. We describe how CoreNEURON can be used as a library with NEURON and then compare performance of different network models on multiple architectures including IBM BlueGene/Q, Intel Skylake, Intel MIC and NVIDIA GPU. We show how CoreNEURON can simulate existing NEURON network models with 4–7x less memory usage and 2–7x less execution time while maintaining binary result compatibility with NEURON.

show abstract

Section: Optimizationsmentioning

confidence: 99%

CoreNEURON : An Optimized Compute Engine for the NEURON Simulator

Kumbhar

Hines

Fouriaux

et al. 2019

Front. Neuroinform.

View full text Add to dashboard Cite

show abstract

“…New HPC architectures such as the addition of ubiquitous GPU resources have been a new challenge, requiring new code adaptation with codes such as CoreNEURON and GeNN for single-node GPU neuronal networks. Developing performant algorithms for computing the Hines matrix on GPUs and other vectorize hardware has been an additional hurdle [8], [9]. The development of Arbor [10] has focused on tackling issues of vectorization and emerging hardware architectures by using modern C++ and automated code generation, within an opensource and open-development model.…”

Section: Introductionmentioning

confidence: 99%

Arbor — A Morphologically-Detailed Neural Network Simulation Library for Contemporary High-Performance Computing Architectures

Akar¹,

Cumming²,

Karakasis³

et al. 2019

2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

View full text Add to dashboard Cite

We introduce Arbor, a performance portable library for simulation of large networks of multi-compartment neurons on HPC systems. Arbor is open source software, developed under the auspices of the HBP. The performance portability is by virtue of back-end specific optimizations for x86 multicore, Intel KNL, and NVIDIA GPUs. When coupled with low memory overheads, these optimizations make Arbor an order of magnitude faster than the most widely-used comparable simulation software. The single-node performance can be scaled out to run very large models at extreme scale with efficient weak scaling.

show abstract

“…Big efforts have been carried out by the scientific community in order to increase SpMV performance. An important part of the optimization of scientific codes consists of using the appropriate format to represent matrices in memory [21,20,8,25]. Following different approaches, cache performance, data locality and, consequently, the overall performance of SpMV, has been proven to be affected substantially.…”

Section: State Of the Artmentioning

confidence: 99%

Towards an Auto-Tuned and Task-Based SpMV (LASs Library)

Catalán

Usui

Toledo

et al. 2020

OpenMP: Portable Multi-Level Parallelism on Modern Systems

Self Cite

View full text Add to dashboard Cite

We present a novel approach to parallelize the SpMV kernel included in LASs (Linear Algebra routines on OmpSs) library, after a deep review and analysis of several well-known approaches. LASs is based on OmpSs, a task-based runtime that extends OpenMP directives, providing more flexibility to apply new strategies. Based on tasking and nesting, with the aim of improving the workload imbalance inherent to the SpMV operation, we present a strategy especially useful for highly imbalanced input matrices. In this approach, the number of created tasks is dynamically decided in order to maximize the use of the resources of the platform. Throughout this paper, SpMV behavior depending on the selected strategy (state of the art and proposed strategies) is deeply analyzed, setting in this way the base for a future auto-tunable code that is able to select the most suitable approach depending on the input matrix. The experiments of this work were carried out for a set of 12 matrices from the Suite Sparse Matrix Collection, all of them with different characteristics regarding their sparsity. The experiments of this work were performed on a node of Marenostrum 4 supercomputer (with two sockets Intel Xeon, 24 cores each) and on a node of Dibona cluster (using one ARM ThunderX2 socket with 32 cores). Our tests show that, for Intel Xeon, the best parallelization strategy reduces the execution time of the reference MKL multi-threaded version up to 67%. On ARM ThunderX2, the reduction is up to 56% with respect to the OmpSs parallel reference.

show abstract

Cited by 15 publications

References 7 publications

CoreNEURON : An Optimized Compute Engine for the NEURON Simulator

CoreNEURON : An Optimized Compute Engine for the NEURON Simulator

Arbor — A Morphologically-Detailed Neural Network Simulation Library for Contemporary High-Performance Computing Architectures

Towards an Auto-Tuned and Task-Based SpMV (LASs Library)

Contact Info

Product

Resources

About