The convergence between computing-and data-centric workloads and platforms is imposing new challenges on how to best use the resources of modern computing systems. In this paper, we investigate alternatives for the storage subsystem of a novel exascale-capable system with special emphasis on how allocation strategies would affect the overall performance. We consider several aspects of data-aware allocation such as the effect of spatial and temporal locality, the affinity of data to storage sources, and the network-level traffic prioritization for different types of flows.In our experimental set-up, temporal locality can have a substantial effect on application runtime (up to a 10% reduction), whereas spatial locality can be even more significant (up to one order of magnitude faster with perfect locality). The use of structured access patterns to the data and the allocation of bandwidth at the network level can also have a significant impact (up to 20% and 17% reduction of runtime, respectively). These results suggest that scheduling policies exposing data-locality information can be essential for the appropriate utilization of future large-scale systems. Finally, we found that the distributed storage system we are implementing can outperform traditional SAN architectures, even with a much smaller (in terms of I/O servers) back-end.
KEYWORDSinter-processor communications, near-data computing, resource allocation, scheduling, storage traffic
INTRODUCTIONTraditionally, supercomputers have been used to execute large computing-intensive parallel applications such as scientific codes. However, nowadays, new types of data-oriented applications are becoming increasingly popular. In contrast with traditional high performance computing (HPC) codes, they have to process massive amounts of scientific or business-oriented data and, hence, impose completely different needs to the computing systems.Indeed, new hardware and software are being developed to suit these necessities. One of these systems is our novel, custom-made architecture,ExaNeSt. 1 We are working on the design and construction of a prototype capable of reaching Exascale computation using tens of millions of interconnected low-power-consumption ARM cores. 2 To support such kind of data-intensive applications, we are leveraging a unified, low-latency Interconnection Network (hereafter, IN) and a fully distributed storage subsystem, BeeGFS, with the data spread across the nodes in local non-volatile storage devices 3 (NVM). This greatly contrasts with traditional supercomputers and datacenters that rely on Storage Area Networks (SAN) to access the data with separate networks for I/O, system management, and inter-processor communications (IPC).A fully distributed file system allows for near-data computation, reducing the great overheads of moving data from the centralized storage to the compute nodes. 4 A single, consolidated IN offers enormous power-savings when compared with multi-network designs. While a unified IN does, indeed, allow us to cope with power and co...