2020 IEEE/ACM HPC for Urgent Decision Making (UrgentHPC) 2020
DOI: 10.1109/urgenthpc51945.2020.00011
|View full text |Cite
|
Sign up to set email alerts
|

Towards Interactive, Reproducible Analytics at Scale on HPC Systems

Abstract: The growth in scientific data volumes has resulted in a need to scale up processing and analysis pipelines using High Performance Computing (HPC) systems. These workflows need interactive, reproducible analytics at scale. The Jupyter platform provides core capabilities for interactivity but was not designed for HPC systems. In this paper, we outline our efforts that bring together core technologies based on the Jupyter Platform to create interactive, reproducible analytics at scale on HPC systems. Our work is … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 21 publications
(20 reference statements)
0
2
0
Order By: Relevance
“…It utilizes Dask, which is a flexible library for parallel computing in Python and allows for dynamic task scheduling for optimizing computational workloads as well as handling large datasets. 43,44 Furthermore, it uses Apache Parquet file format, which is an efficient, structured, and column-oriented binary format capable of file size reduction and efficient reading of the desired stored values without the need to load the entire file into memory. 45 More concretely, to address the above-mentioned challenges, PARAMOUNT has been developed with the capability of assembling the data matrix Y for a list of variables of interest.…”
Section: Pod Formulation In Paramountmentioning
confidence: 99%
“…It utilizes Dask, which is a flexible library for parallel computing in Python and allows for dynamic task scheduling for optimizing computational workloads as well as handling large datasets. 43,44 Furthermore, it uses Apache Parquet file format, which is an efficient, structured, and column-oriented binary format capable of file size reduction and efficient reading of the desired stored values without the need to load the entire file into memory. 45 More concretely, to address the above-mentioned challenges, PARAMOUNT has been developed with the capability of assembling the data matrix Y for a list of variables of interest.…”
Section: Pod Formulation In Paramountmentioning
confidence: 99%
“…In addition, it provides a user-friendly platform to document the software while ensuring reproducibility by combining the data, code and software environment. As Jupyter notebook offers the core, JupyterHub expands the frameworks and brings flexibility to the user group [30]. JupyterHub provides access control and authentication, scalability with support for container and HPC technology, and it is portable from the cloud to a local machine.…”
Section: Notebook and Gitmentioning
confidence: 99%