Abstract. In this paper we present a detailed description of a high-performance distributedmemory implementation of balancing domain decomposition preconditioning techniques. This coverage provides a pool of implementation hints and considerations that can be very useful for scientists that are willing to tackle large-scale distributed-memory machines using these methods. On the other hand, the paper includes a comprehensive performance and scalability study of the resulting codes when they are applied for the solution of the Poisson problem on a large-scale multicore-based distributed-memory machine with up to 4096 cores. Well-known theoretical results guarantee the optimality (algorithmic scalability) of these preconditioning techniques for weak scaling scenarios, as they are able to keep the condition number of the preconditioned operator bounded by a constant with fixed load per core and increasing number of cores. The experimental study presented in the paper complements this mathematical analysis and answers how far can these methods go in the number of cores and the scale of the problem to still be within reasonable ranges of efficiency on current distributed-memory machines. Besides, for those scenarios where poor scalability is expected, the study precisely identifies, quantifies and justifies which are the main sources of inefficiency.Key words. Domain decomposition, parallelization, scalability, coarse-grid correction, balancing domain decomposition, BNN, BDDC AMS subject classifications. 65N55, 65F08, 65N30, 65Y05, 65Y201. Introduction. Scientific phenomena governed by partial differential equations (PDEs) can range from solid mechanics to fluid mechanics and electrodynamics, including any of the possible couplings. The solution of these equations can be approximated with the aid of computers by a discretization (and possibly linearization) and the subsequent numerical solution of the resulting sparse set of linear equations. This work is concerned with the fast solution of the Poisson problem discretized by the finite element (FE) method. Although the Poisson problem is the simplest model problem for, e.g., fluid flow simulation, it is still very useful as a building block for the "physics-based" preconditioning of very complex scientific applications governed by coupled systems of PDEs [1].The ever increasing demand of reality in the simulation of the complex scientific and engineering three-dimensional (3D) problems faced nowadays ends up with the solution of very large and sparse linear systems with several hundreds and even thousands of millions of equations/unknowns. The solution of these systems in a moderate time requires the vast amount of computational resources provided by current multicore-based distributed-memory machines. It is therefore essential to design parallel algorithms able to take profit of their underlying architecture.