José R. Herrero scite author profile

The promise of future many-core processors, with hundreds of threads running concurrently, has led the developers of linear algebra libraries to rethink their design in order to extract more parallelism,\ud further exploit data locality, attain better load balance, and pay careful attention to the critical path of computation. In this paper we describe how existing serial libraries such as (C)LAPACK and FLAME can be easily parallelized using the SMPSs tools, consisting of a few OpenMP-like pragmas and a runtime system. In the LAPACK case, this usually requires the development of blocked algorithms for simple BLAS-level operations, which expose concurrency at a finer grain. For better performance, our experimental results indicate that column-major order, as employed by this library, needs to be abandoned in benefit of a block data layout. This will require a deeper rewrite of LAPACK or, alternatively, a dynamic conversion of the storage pattern at run-time. The parallelization of FLAME routines using SMPSs is simpler as this library includes blocked algorithms (or algorithms-by-blocks in the FLAME argot) for most operations and storage-by-blocks (or block data layout) is already in place.Peer ReviewedPostprint (published version

show abstract

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting

Catalán

Herrero

Quintana‐Ortí

et al. 2019

IEEE Access

View full text Add to dashboard Cite

We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. The second technique allows a fast task to alert the slower task of completion, enforcing the early termination (ET) of the second task, and a smooth transition of the factorization procedure into the next iteration.The two mechanisms are instantiated via a new malleable thread-level implementation of the Basic Linear Algebra Subprograms (BLAS), and their benefits are illustrated via an implementation of the LU factorization with partial pivoting enhanced with look-ahead. Concretely, our experimental results on a six core Intel-Xeon processor show the benefits of combining WS+ET, reporting competitive performance in comparison with a taskparallel runtime-based solution.

show abstract

A highly parallel algorithm for computing the action of a matrix exponential on a vector based on a multilevel Monte Carlo method

Acebrón

Herrero

Monteiro

2020

Computers & Mathematics with Applications

View full text Add to dashboard Cite

A novel algorithm for computing the action of a matrix exponential over a vector is proposed. The algorithm is based on a multilevel Monte Carlo method, and the vector solution is computed probabilistically generating suitable random paths which evolve through the indices of the matrix according to a suitable probability law. The computational complexity is proved in this paper to be significantly better than the classical Monte Carlo method, which allows the computation of much more accurate solutions. Furthermore, the positive features of the algorithm in terms of parallelism were exploited in practice to develop a highly scalable implementation capable of solving some test problems very efficiently using high performance supercomputers equipped with a large number of cores. For the specific case of shared memory architectures the performance of the algorithm was compared with the results obtained using an available Krylov-based algorithm, outperforming the latter in all benchmarks analyzed so far.

show abstract

On new computational local orders of convergence

Grau-Sánchez

Noguera

Grau

et al. 2012

Applied Mathematics Letters

View full text Add to dashboard Cite

Reduction to Tridiagonal Form for Symmetric Eigenproblems on Asymmetric Multicore Processors

Alonso

Catalán

Herrero

et al. 2017

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

José R. Herrero

Parallelizing dense and banded linear algebra libraries using SMPSs

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting

A highly parallel algorithm for computing the action of a matrix exponential on a vector based on a multilevel Monte Carlo method

On new computational local orders of convergence

Reduction to Tridiagonal Form for Symmetric Eigenproblems on Asymmetric Multicore Processors

Contact Info

Product

Resources

About