Modern mainstream powerful computers adopt multisocket multicore CPU architecture and NUMA-based memory architecture. While traditional work-stealing schedulers are designed for single-socket architectures, they incur severe shared cache misses and remote memory accesses in these computers. To solve the problem, we propose a locality-aware work-stealing (LAWS) scheduler, which better utilizes both the shared cache and the memory system. In LAWS, a load-balanced task allocator is used to evenly split and store the dataset of a program to all the memory nodes and allocate a task to the socket where the local memory node stores its data for reducing remote memory accesses. Then, an adaptive DAG packer adopts an auto-tuning approach to optimally pack an execution DAG into cache-friendly subtrees. After cache-friendly subtrees are created, every socket executes cache-friendly subtrees sequentially for optimizing shared cache usage. Meanwhile, a triple-level work-stealing scheduler is applied to schedule the subtrees and the tasks in each subtree. Through theoretical analysis, we show that LAWS has comparable time and space bounds compared with traditional work-stealing schedulers. Experimental results show that LAWS can improve the performance of memory-bound programs up to 54.2% on AMD-based experimental platforms and up to 48.6% on Intel-based experimental platforms compared with traditional work-stealing schedulers.
ACM Reference Format:Quan Chen and Minyi Guo. 2015. Locality-aware work stealing based on online profiling and auto-tuning for multisocket multicore architectures. ACM Trans. " , which was published in the International Conference on Supercomputing (ICS 2014). The 30% new material comes from two aspects:-We have analyzed the theoretical time and space bounds of LAWS. Based on our analysis, the theoretical time and space bounds are comparable to the original random work-stealing scheduler. -This article has also significantly enhanced the experimental evaluation. We have evaluated the performance of LAWS on both Intel-based platforms and AMD-based platforms.