Benefits of high speed interconnects to cluster file systems: a case study with Lustre

Yu, Weikuan; Noronha, R.; Liang, Shuang; Panda, Dhabaleswar K.

doi:10.1109/ipdps.2006.1639564

Cited by 7 publications

(5 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The shared storage may be a dual-hosted hard drive, a networked storage using RAID, or a distributed replicated block device (DRBD) [154]. Such solutions have been extensively used in HPC environments for critical system services, such as the job and resource manager (e.g., SLURM [192] and Sun Grid Engine (SGE) [175]) and the parallel file system MDS (e.g., Parallel Virtual File System (PVFS) [146] and Lustre [194]). • Active/hot-standby redundancy using a commit protocol for state replication has been implemented for some HPC job and resource managers as part of high availability cluster solutions, such as HA-OSCAR [119] with its commit protocol for OpenPBS [16].…”

Section: Rationalementioning

confidence: 99%

“…An implementation of HA-OSCAR supported high availability clustering for two job and resource managers, OpenPBS [16] and SGE [175]). Parallel file system MDSs, such as Lustre [194], support high availability clustering as well. • Active/standby redundancy also plays a role in resilience for parallel applications in HPC environments.…”

Section: Rationalementioning

confidence: 99%

“…Active/Standby redundancy is typically used for critical hardware or software systems in HPC environments. For example, power supplies, voltage regulators, the parallel file system MDS in Lustre [194] and the SLURM [192] job and resource manager are often implemented in an active/standby fashion. Dual-modular redundancy for error detection and failure compensation and triple-modular redundancy for error detection and correction and failure compensation [78], dual-redundant parallel file system MDS solutions [92] and dual-redundant mission-critical HPC systems (e.g., weather forecast).…”

Section: Solutionmentioning

confidence: 99%

“…Examples: The Active/Standby structural pattern is typically used for critical hardware or software systems in HPC environments. For example, power supplies, voltage regulators, the parallel file system MDS in Lustre [194] and the SLURM [192] job and resource manager are often implemented in an active/standby fashion.…”

Section: Table 21 Active/standby Pattern Parametersmentioning

confidence: 99%

See 3 more Smart Citations

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)

Engelmann¹,

Ashraf²,

Hukerikar³

et al. 2022

View full text Add to dashboard Cite

show abstract

Section: Rationalementioning

confidence: 99%

Section: Rationalementioning

confidence: 99%

Section: Solutionmentioning

confidence: 99%

Section: Table 21 Active/standby Pattern Parametersmentioning

confidence: 99%

See 2 more Smart Citations

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)

Engelmann¹,

Ashraf²,

Hukerikar³

et al. 2022

View full text Add to dashboard Cite

show abstract

“…There have been many efforts in parallel and distributed data management systems to provide large I/O bandwidth [4,16,23]. However, metadata management is still a challenging problem in widely distributed large-scale storage systems.…”

Section: Introductionmentioning

confidence: 99%

PetaShare: A Reliable, Efficient and Transparent Distributed Storage Management System

Kosar

Akturk

Balman

et al. 2011

Scientific Programming

View full text Add to dashboard Cite

Abstract.Modern collaborative science has placed increasing burden on data management infrastructure to handle the increasingly large data archives generated. Beside functionality, reliability and availability are also key factors in delivering a data management system that can efficiently and effectively meet the challenges posed and compounded by the unbounded increase in the size of data generated by scientific applications. We have developed a reliable and efficient distributed data storage system, PetaShare, which spans multiple institutions across the state of Louisiana. At the back-end, PetaShare provides a unified name space and efficient data movement across geographically distributed storage sites. At the front-end, it provides light-weight clients the enable easy, transparent and scalable access. In PetaShare, we have designed and implemented an asynchronously replicated multi-master metadata system for enhanced reliability and availability, and an advanced buffering system for improved data transfer performance. In this paper, we present the details of our design and implementation, show performance results, and describe our experience in developing a reliable and efficient distributed data management system for data-intensive science.

show abstract

Performance Evaluation of A Infiniband-based Lustre Parallel File System

Yuan

Qiu

et al. 2011

Procedia Environmental Sciences

View full text Add to dashboard Cite

Benefits of high speed interconnects to cluster file systems: a case study with Lustre

Abstract: Abstract

Cited by 7 publications

References 12 publications

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)

PetaShare: A Reliable, Efficient and Transparent Distributed Storage Management System

Performance Evaluation of A Infiniband-based Lustre Parallel File System

Contact Info

Product

Resources

About