We propose a multilevel method to speed highly optimized parallel codes whose runtime increases faster than their workload. This method requires the ability to solve large instances by decomposing them into smaller instances. Using a simple parallel computing model, we derive a mathematical model that predicts whether or not our method can improve performance and also predicts the amount of improvement attainable. Our method is tested and shown to be effective on three highly optimized BLAS (Basic Linear Algebra Subprograms) routines from Intel's Math Kernel Library (MKL). Those routines are cblas dgemm, cblas dtrmm and cblas dsymm. On the Intel Knights Landing (KNL) platform our method speeds cblas dgemm by 33%, cblas dtrmm by 50% and cblas dsymm by 49% on double-precision matrices of size 16K × 16K, when the KNL's default memory-clustering configuration (cache-quadrant) is used.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.