Scientific data has traditionally been distributed via downloads from
data server to local computer. This way of working suffers from
limitations as scientific datasets grow towards the petabyte scale. A
“cloud-native data repository,” as defined in this paper, offers
several advantages over traditional data repositories—performance,
reliability, cost-effectiveness, collaboration, reproducibility,
creativity, downstream impacts, and access & inclusion. These
objectives motivate a set of best practices for cloud-native data
repositories: analysis-ready data, cloud-optimized (ARCO) formats, and
loose coupling with data-proximate computing. The Pangeo Project has
developed a prototype implementation of these principles by using
open-source scientific Python tools. By providing an ARCO data catalog
together with on-demand, scalable distributed computing, Pangeo enables
users to process big data at rates exceeding 10 GB/s. Several challenges
must be resolved in order to realize cloud computing’s full potential
for scientific research, such as organizing funding, training users, and
enforcing data privacy requirements.
As more analysis-ready datasets are provided on the cloud, we need to consider how researchers access data. To maximize performance and minimize costs, we move the analysis to the data. This notebook demonstrates a Pangeo deployment connected to multiple Dask Gateways to enable analysis, regardless of where the data is stored. Public clouds are partitioned into regions, a geographic location with a cluster of data centers. A dataset like the National Water Model Short-Range Forecast is provided in a single region of some cloud provider (e.g. AWS's us-east-1). To analyze that dataset efficiently, we do the analysis in the same region as the dataset. That's especially true for very large datasets. Making local "dark replicas" of the datasets is slow and expensive. In this notebook we demonstrate a few open source tools to compute "close" to cloud data. We use Intake as a data catalog, to discover the datasets we have available and load them as an xarray Dataset. With xarray, we're able to write the necessary transformations, filtering, and reductions that compose our analysis. To process the large amounts of data in parallel, we use Dask. Behind the scenes, we've configured this Pangeo deployment with multiple Dask Gateways, which provide a secure, multi-tenant server for managing Dask clusters. Each Gateway is provisioned with the necessary permissions to access the data. By placing compute (the Dask workers) in the same region as the dataset, we achieve the highest performance: these worker machines are physically close to the machines storing the data and have the highest bandwidth. We minimize cost by avoiding egress costs: fees charged to the data provider when data leaves a cloud region.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.