Jakob Luettgau scite author profile

Kunkel

2017

Due to the variety of storage technologies deep storage hierarchies turn out to be the most feasible choice to meet performance and cost requirements when handling vast amounts of data. Long-term archives employed by scientific users are mainly reliant on tape storage, as it remains the most cost-efficient option. Archival systems are often loosely integrated into the HPC storage infrastructure. In expectation of exascale systems and in situ analysis also burst buffers will require integration with the archive. Exploring new strategies and developing open software for tape systems is a hurdle due to the lack of affordable storage silos and availability outside of large organizations and due to increased wariness requirements when dealing with ultra-durable data. Lessening these problems by providing virtual storage silos should enable community-driven innovation and enable site operators to add features where they see fit while being able to verify strategies before deploying on production systems. Different models for the individual components in tape systems are developed. The models are then implemented in a prototype simulation using discrete event simulation. The work shows that the simulations can be used to approximate the behavior of tape systems deployed in the real world and to conduct experiments without requiring a physical tape system.

NSDF-Catalog: Lightweight Indexing Service for Democratizing Data Delivery

Scorzelli

Pascucci

et al. 2022

NSDF-Fuse

Olaya

Zhou

et al. 2022

This work presents NSDF-FUSE, a testbed for evaluating settings and performance of FUSE-based file systems on top of S3-compatible object storage; the testbed is part of a suite of services from the National Science Data Fabric (NSDF) project (an NSF-funded project that is delivering cyberinfrastructures for data scientists). We demonstrate how NSDF-FUSE can be deployed to evaluate eight different mapping packages that mount S3-compatible object storage to a file system, as well as six data patterns representing different I/O operations on two cloud platforms. NSDF-FUSE is open-source and can be easily extended to run with other software mapping packages and different cloud platforms.

NSDF-Cloud

Olaya

Zhou

et al. 2022

Computational resources are increasingly provisioned to users through cloud-like interfaces. Both academic and commercial cloud offerings exist, but no single standardized interface for common actions such as configuration, launching, and termination of virtual resources exists. This imposes huge technical burden on domain scientist that attempt to take advantage of these resources; even expert users spend considerable time to port their applications from one cloud platform to another.With this work, we make available to the community a unified API toolkit as well as five in-depth reports on challenges we encountered working with different academic and commercial cloud providers. Our toolkit implements automations for common tasks such as simultaneous launching and termination of large numbers of virtual machines (VM) across the cloud. We demonstrate that our toolkit brings down the time users need to spend launching and terminating these resources to mere minutes, thus enabling ad-hoc multi-cloud clusters.

Enabling Call Path Querying in Hatchet to Identify Performance Bottlenecks in Scientific Applications

Lumsden

Lama

et al. 2022

Development of Large-Scale Scientific Cyberinfrastructure and the Growing Opportunity to Democratize Access to Platforms and Data

Scorzelli

Pascucci

et al. 2023

Orchestration of materials science workflows for heterogeneous resources at large scale

Zhou

Scorzelli

The International Journal of High Performance Computing Applica

et al. 2023

In the era of big data, materials science workflows need to handle large-scale data distribution, storage, and computation. Any of these areas can become a performance bottleneck. We present a framework for analyzing internal material structures (e.g., cracks) to mitigate these bottlenecks. We demonstrate the effectiveness of our framework for a workflow performing synchrotron X-ray computed tomography reconstruction and segmentation of a silica-based structure. Our framework provides a cloud-based, cutting-edge solution to challenges such as growing intermediate and output data and heavy resource demands during image reconstruction and segmentation. Specifically, our framework efficiently manages data storage, scaling up compute resources on the cloud. The multi-layer software structure of our framework includes three layers. A top layer uses Jupyter notebooks and serves as the user interface. A middle layer uses Ansible for resource deployment and managing the execution environment. A low layer is dedicated to resource management and provides resource management and job scheduling on heterogeneous nodes (i.e., GPU and CPU). At the core of this layer, Kubernetes supports resource management, and Dask enables large-scale job scheduling for heterogeneous resources. The broader impact of our work is four-fold: through our framework, we hide the complexity of the cloud’s software stack to the user who otherwise is required to have expertise in cloud technologies; we manage job scheduling efficiently and in a scalable manner; we enable resource elasticity and workflow orchestration at a large scale; and we facilitate moving the study of nonporous structures, which has wide applications in engineering and scientific fields, to the cloud. While we demonstrate the capability of our framework for a specific materials science application, it can be adapted for other applications and domains because of its modular, multi-layer architecture.

Ubique: A New Model for Untangling Inter-task Data Dependence in Complex HPC Workflows

Yeom

Ahn

Lumsden

et al. 2022

Exploiting task parallelism is getting increasingly difficult for diverse and complex scientific workflows running on High Performance Computing (HPC) systems. We argue that the difficulty rises from a void in the spectrum of existing data-transfer models for resolving inter-task data dependence within a workflow and propose a novel model, Ubique, to fill that gap. The Ubique model combines the best from in-transit and in situ models in order for loosely coupled producer and consumer tasks to run concurrently and to resolve their data dependencies efficiently with little or no modifications to their codes, striking a balance between transparent optimization, productivity, and performance. Our preliminary evaluation suggests that Ubique can significantly outperform the parallel file system (PFS)-based model while offering transparent data transfer and synchronization which are the features lacking in many traditional models.