Abstract-Hardware accelerators have become a de-facto standard to achieve high performance on current supercomputers and there are indications that this trend will increase in the future. Modern accelerators feature high-bandwidth memory next to the computing cores. For example, the Intel Knights Landing (KNL) processor is equipped with 16 GB of high-bandwidth memory (HBM) that works together with conventional DRAM memory. Theoretically, HBM can provide ∼ 4× higher bandwidth than conventional DRAM. However, many factors impact the effective performance achieved by applications, including the application memory access pattern, the problem size, the threading level and the actual memory configuration. In this paper, we analyze the Intel KNL system and quantify the impact of the most important factors on the application performance by using a set of applications that are representative of scientific and data-analytics workloads. Our results show that applications with regular memory access benefit from MCDRAM, achieving up to 3× performance when compared to the performance obtained using only DRAM. On the contrary, applications with random memory access pattern are latency-bound and may suffer from performance degradation when using only MCDRAM. For those applications, the use of additional hardware threads may help hide latency and achieve higher aggregated bandwidth when using HBM.
Despite being one of the most important limiting factors on the road to exascale computing, power is not yet considered a "first-class citizen" among the system resources. As a result, there is no clear OS interface that exposes accurate resource power consumption to user-level runtimes that implement power-aware software algorithms.In this work we propose a System Monitor Interface (SMI) between the OS and the user runtime that exposes accurate, per-core power consumption. To make up for the lack of reliable per-core power sensors, we implement a proxy power sensor, based on a regression analysis of core activity, that provides per-core information. SMI effectively hides the implementation details from the user, who has the perception of reading power information from a real sensor. This allows us these proxy sensors to be replaced with real hardware sensors when the latter becomes available, without the need to modify user-level software.Using SMI and the proxy power sensors, we implement a power profiling runtime library and analyzed applications from the NPB benchmark suite and the Exascale Co-Design Centers. Our results show that accurate, per-core power information is necessary for the development of exascale system software and for comprehensively understanding the power characteristics of parallel scientific applications.
Task-based programming models are considered one of the most promising programming model approaches for exascale supercomputers because of their ability to dynamically react to changing conditions and reassign work to processing elements. One question, however, remains unsolved: what should the task granularity of task-based applications be? Finegrained tasks offer more opportunities to balance the system and generally result in higher system utilization. However, they also induce in large scheduling overhead. The impact of scheduling overhead on coarse-grained tasks is lower, but large systems may result imbalanced and underutilized.In this work we propose a methodology to analyze the interplay between application task granularity and scheduling overhead. Our methodology is based on three main points: 1) a novel task algorithm that analyzes an application directed acyclic graph (DAG) and aggregates tasks; 2) a fast and precise emulator to analyze the application behavior on systems with up to 1,024 cores; 3) a comprehensive sensitivity analysis of application performance and scheduling overhead breakdown. Our results show that there is an optimal task granularity between 1.2x10 4 and 10x10 4 cycles for the representative schedulers. Moreover, our analysis indicates that a suitable scheduler for exascale task-based applications should employ a best-effort local scheduler and a sophisticated remote scheduler to move tasks across worker threads.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.