In heterogeneous supercomputers such as TSU-BAME2.5, GPUs on some nodes in GPU batch queues are left idle even though there are jobs waiting in the queues; this is caused by GPU resource-assignment fragmentation problem. For example, in the case that each node has three GPUs like TSUBAME2.5's, if a node has already been assigned to a job requesting two GPUs per node, that node cannot be assigned to another job requesting more than one GPU per node until the ongoing job finishes; hence, one GPU is left idle on that node. We examine this problem on TSUBAME2.5's GPU batch-queue system and present a scheduling algorithm that assigns rCUDA (a remote CUDA execution technology) to some processes of some jobs. Because rCUDA allows jobs to utilize the idle GPUs, the proposed scheduling algorithm can alleviate the problem. Using a job pattern obtained from a scheduler log of a TSUBAME2.5's GPU queue, our simulation shows that the proposed algorithm can decrease jobs' lifetime (from the time when a job arrives until finishes) by about 5% on average. Moreover, it can reduce the average number of idle GPUs by about 15%. Also, even reducing the number of nodes serving jobs by around 4%, the proposed algorithm can maintain the average jobs' lifetime around the same as the scheduling algorithm currently used in the TSUBAME2.5's GPU queue.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.