Hierarchical Collective I/O Scheduling for High-Performance Computing

Liu, Jialin; Zhuang, Yu; Chen, Yong

doi:10.1016/j.bdr.2015.01.007

Cited by 9 publications

(3 citation statements)

References 31 publications

(58 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A typical practice is that each developer may focus on only fixing one single failing test case where the passing test cases are utilized. The idea of parallel is pretty popular is the area of big data and I/O [15] [16]. Although some techniques, including test cases clustering, are possibly adoptable in parallel debugging, this paper employs a plain and simple parallel debugging where each processer, and not necessarily test engineer, debug one single failing test case where all the other passing test cases are also utilized.…”

Section: Parallel and Sequential Debuggingmentioning

confidence: 99%

Debugging in Parallel or Sequential: An Empirical Study

Pang¹,

Xue²,

Namin³

2015

JSW

View full text Add to dashboard Cite

Faults need to be identified, localized, and removed from programs. Empirical studies show that coverage-based faults localizations effectively target bugs, even in the presence of multiple faults. Debugging is a time-consuming activity and thus it is beneficial to accelerate the process by employing appropriate techniques. The need for speeding up the debugging process is even more immense when the program under test contains multiple faults. A program with multiple faults can be debugged in parallel where each sub-process specifically targets localizing one of the bugs. The immediate research question that arises here is that how significant is the improvement and performance achieved when debugging is performed in parallel compared to the sequential fault localization strategy. This paper investigates and compares the performance of parallel and sequential debugging in effectively localizing faults where the performance is measured according to fault localization cost required by each strategy. Based on the experimental study of several open source Java programs conducted in this paper, we observe that debugging when is performed in parallel outperforms sequential strategy in terms of the total cost.

show abstract

Section: Parallel and Sequential Debuggingmentioning

confidence: 99%

Debugging in Parallel or Sequential: An Empirical Study

Pang¹,

Xue²,

Namin³

2015

JSW

View full text Add to dashboard Cite

show abstract

“…The occurrence of stragglers has significant effects on I/O performance of object storage systems. Since in HPC applications, clients normally need to synchronize after each I/O phase [3,17], the overall I/O performance will be determined by the longest one, which in turn is determined by the slowest object storage server. In general, the slow storage servers (i.e., stragglers) can be divided into two categories: long-term stragglers and short-term stragglers.…”

Section: Introductionmentioning

confidence: 99%

Client-side straggler-aware I/O scheduler for object-based parallel file systems

2019

Self Cite

View full text Add to dashboard Cite

Object-based parallel file systems have emerged as promising storage solutions for high-performance computing (HPC) systems. Despite the fact that object storage provides a flexible interface, scheduling highly concurrent I/O requests that access a large number of objects still remains as a challenging problem, especially in the case when stragglers (storage servers that are significantly slower than others) exist in the system. An efficient I/O scheduler needs to avoid possible stragglers to achieve low latency and high throughput. In this paper, we introduce a logassisted straggler-aware I/O scheduling to mitigate the impact of storage server stragglers. The contribution of this study is threefold. First, we introduce a client-side, log-assisted, straggler-aware I/O scheduler architecture to tackle the storage straggler issue in HPC systems. Second, we present three scheduling algorithms that can make efficient decision for scheduling I/Os while avoiding stragglers based on such an architecture. Third, we evaluate the proposed I/O scheduler using simulations, and the simulation results have confirmed the promise of the newly introduced straggler-aware I/O scheduler.

show abstract

“…Estimating the time that applications spend on reading and writing data is a common task for their I/O performance tuning [19,20,10,31] and job scheduling [14] on HPC systems, and SQL query plan optimization [2,30] in database systems. One specific example is the Scientific Data Services Framework (SDS) [8,7,9] we are working on.…”

Section: Introductionmentioning

confidence: 99%

Heavy-tailed distribution of parallel I/O system response time

Dong

Byna

2015

Proceedings of the 10th Parallel Data Storage Workshop

View full text Add to dashboard Cite

Estimating I/O time of applications is critical for computing system research and developments, such as performance tuning and job scheduling. Parallel I/O systems on large-scale HPC systems typically use several I/O servers attached to a number of hard disk drives to read and write data concurrently. As a result, the response time of individual I/O servers affects the overall I/O performance and modeling the response time distribution holds the key to estimate I/O time. Existing studies have generally considered that the response time follows a Uniform or a Normal distribution. However, none of these studies considered supercomputing environments that are actively used by a number of users to verify the existence of Uniform or Normal distributions. In this study, we collected ≈ 2,500,000 measurements on two peta-scale class supercomputers that are actively used by ≈5000 users. These two systems, Hopper and Edison at the National Energy Research Scientific Computing Center (NERSC), typically support hundreds of concurrent jobs. Our performance measurements include the overheads introduced by the entire parallel I/O stack (I/O library, network, parallel file system software, cache and hardware). Our study shows that the response time of parallel I/O system follows a heavy-tailed property, in contrary to the widely accepted Normal or Uniform distributions. In exploring for new models, we identify that a mix of Power Law and Normal distributions is a good fit for the response time of parallel I/O systems that are actively used by hundreds of jobs concurrently.

show abstract

Hierarchical Collective I/O Scheduling for High-Performance Computing

Cited by 9 publications

References 31 publications

Debugging in Parallel or Sequential: An Empirical Study

Debugging in Parallel or Sequential: An Empirical Study

Client-side straggler-aware I/O scheduler for object-based parallel file systems

Heavy-tailed distribution of parallel I/O system response time

Contact Info

Product

Resources

About