Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. Because these machines are typically "spaceshared," each job must wait in a queue until sufficient processor resources become available to service it. In production computing settings, the queuing delay (experienced by users as the time between when the job is submitted and when it begins execution) is highly variable. Users often find this variability a drag on productivity as it makes planning difficult and intellectual continuity hard to maintain.In this work, we introduce an on-line system for predicting batch-queue delay and show that it generates correct and accurate bounds for queuing delay for batch jobs from 11 machines over a 9-year period. Our system comprises 4 novel and interacting components: a predictor based on nonparametric inference; an automated change-point detector; machinelearned, model-based clustering of jobs having similar characteristics; and an automatic downtime detector to identify systemic failures that affect job queuing delay. We compare the correctness and accuracy of our system against various previously used prediction techniques and show that our new method outperforms them for all machines we have available for study.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.