Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows

Subedi, Pradeep; Davis, Philip E.; Duan, Shaohua; Klasky, Scott; Kolla, Hemanth; Parashar, Manish

doi:10.1109/sc.2018.00076

Cited by 31 publications

(17 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…But the performance improvements that they reported were negligible (0.1-6%). Nevertheless, there is a growing trend in using ML techniques to solve storage and OS problems: predicting index structures in key-value stores [17,38], memory allocation [47], TCP congestion control [24], offline black-box optimization for storage parameters [8], database query optimization [37], local and distributed caching [60,66] and cloud resource management [16,19,20].…”

Section: Related Workmentioning

confidence: 99%

A Machine Learning Framework to Improve Storage System Performance

Akgün

Aydın

Shaikh

et al. 2021

Proceedings of the 13th ACM Workshop on Hot Topics in Storage and File Systems

View full text Add to dashboard Cite

Storage systems and their OS components are designed to accommodate a wide variety of applications and dynamic workloads. Storage components inside the OS contain various heuristic algorithms to provide high performance and adaptability for different workloads. These heuristics may be tunable via parameters, and some system calls allow users to optimize their system performance. These parameters are often predetermined based on experiments with limited applications and hardware. Thus, storage systems often run with these predetermined and possibly suboptimal values. Tuning these parameters manually is impractical: one needs an adaptive, intelligent system to handle dynamic and complex workloads. Machine learning (ML) techniques are capable of recognizing patterns, abstracting them, and making predictions on new data. ML can be a key component to optimize and adapt storage systems. In this position paper, we propose KML, an ML framework for storage systems. We implemented a prototype and demonstrated its capabilities on the well-known problem of tuning optimal readahead values. Our results show that KML has a small memory footprint, introduces negligible overhead, and yet enhances throughput by as much as 2.3×. CCS CONCEPTS• Software and its engineering → Operating systems; File systems management; • Computing methodologies → Machine learning.

show abstract

Section: Related Workmentioning

confidence: 99%

A Machine Learning Framework to Improve Storage System Performance

Akgün

Aydın

Shaikh

et al. 2021

Proceedings of the 13th ACM Workshop on Hot Topics in Storage and File Systems

View full text Add to dashboard Cite

show abstract

“…For example, the scheduling of data transfer between tasks can too often create bottlenecks between computation and communication phases, and manual optimizations are often complex (Huang et al, 2019). We can train ML models to classify the workflow phases to optimize data movements, to orchestrate I/O (Meng et al, 2014; Wang et al, 2015), and to manage hierarchical storage (Dong et al, 2016) and data staging (Subedi et al, 2018). Also, as in-situ execution becomes more prevalent (Huang et al, 2019; Kwan-Liu, 2009; Subedi et al, 2018), ML can play an important role in automating the placement of tasks to automatically find an optimal trade-off.…”

Section: Current Challenges In Scientific Workflowsmentioning

confidence: 99%

The role of machine learning in scientific workflows

Deelman

Mandal

Jiang

et al. 2019

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Machine learning (ML) is being applied in a number of everyday contexts from image recognition, to natural language processing, to autonomous vehicles, to product recommendation. In the science realm, ML is being used for medical diagnosis, new materials development, smart agriculture, DNA classification, and many others. In this article, we describe the opportunities of using ML in the area of scientific workflow management. Scientific workflows are key to today’s computational science, enabling the definition and execution of complex applications in heterogeneous and often distributed environments. We describe the challenges of composing and executing scientific workflows and identify opportunities for applying ML techniques to meet these challenges by enhancing the current workflow management system capabilities. We foresee that as the ML field progresses, the automation provided by workflow management systems will greatly increase and result in significant improvements in scientific productivity.

show abstract

“…TRIO [19] explores how to efficiently move large checkpointing datasets to the PFS by utilizing the burst buffers. Data Elevator [20] and Stacker [21] are similar to NORNS in that they focus on asynchronously moving data across I/O layers to optimize scientific workflows. The former specializes on applications using HDF5 while the latter optimizes data movements using machine learning techniques.…”

Section: Related Workmentioning

confidence: 99%

“…Unfortunately, while computing and network resources can be shared and managed effectively by state-of-the-art job schedulers, storage resources are still mostly considered as black boxes by these infrastruc-978-1-7281-4734-5/19/$31.00 ©2019 IEEE [18]. While there has been increasing interest in HPC to use burst buffers to optimize the I/O path of datadriven workflows through autonomous, asynchronous data staging [19] [20] [21], these research efforts have not considered I/O as a first class entity in resource scheduling decisions. Thus, we argue that the integration of application I/O needs with scheduling and resource managers is critical to effectively use and manage a hierarchical storage stack that can include as many layers as NVRAM, node-local burst buffers, shared burst buffers, parallel file system, campaign storage, and archival storage.…”

Section: Introductionmentioning

confidence: 99%

NORNS: Extending Slurm to Support Data-Driven Workflows through Asynchronous Data Staging

Miranda

Jackson

Tocci

et al. 2019

2019 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

As HPC systems move into the Exascale era, parallel file systems are struggling to keep up with the I/O requirements from data-intensive problems. While the inclusion of burst buffers has helped to alleviate this by improving I/O performance, it has also increased the complexity of the I/O hierarchy by adding additional storage layers each with its own semantics. This forces users to explicitly manage data movement between the different storage layers, which, coupled with the lack of interfaces to communicate data dependencies between jobs in a data-driven workflow, prevents resource schedulers from optimizing these transfers to benefit the cluster's overall performance. This paper proposes several extensions to job schedulers, prototyped using the Slurm scheduling system, to enable users to appropriately express the data dependencies between the different phases in their processing workflows. It also introduces a new service for asynchronous data staging called NORNS that coordinates with the job scheduler to orchestrate data transfers to achieve better resource utilization. Our evaluation shows that a workflow-aware Slurm exploits node-local storage more effectively, reducing the filesystem I/O contention and improving job running times.

show abstract

Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows

Cited by 31 publications

References 31 publications

A Machine Learning Framework to Improve Storage System Performance

A Machine Learning Framework to Improve Storage System Performance

The role of machine learning in scientific workflows

NORNS: Extending Slurm to Support Data-Driven Workflows through Asynchronous Data Staging

Contact Info

Product

Resources

About