Physical samples are foundational entities for research across biological, Earth, and environmental sciences. Data generated from sample-based analyses are not only the basis of individual studies, but can also be integrated with other data to answer new and broader-scale questions. Ecosystem studies increasingly rely on multidisciplinary team-science to study climate and environmental changes. While there are widely adopted conventions within certain domains to describe sample data, these have gaps when applied in a multidisciplinary context. In this study, we reviewed existing practices for identifying, characterizing, and linking related environmental samples. We then tested practicalities of assigning persistent identifiers to samples, with standardized metadata, in a pilot field test involving eight United States Department of Energy projects. Participants collected a variety of sample types, with analyses conducted across multiple facilities. We address terminology gaps for multidisciplinary research and make recommendations for assigning identifiers and metadata that supports sample tracking, integration, and reuse. Our goal is to provide a practical approach to sample management, geared towards ecosystem scientists who contribute and reuse sample data.
Machine learning can provide sustainable solutions to gap-fill groundwater (GW) data needed to adequately constrain watershed models. However, imputing missing extremes is more challenging than other parts of a hydrograph. To impute missing subhourly data, including extremes, within GW time-series data collected at multiple wells in the East River watershed, located in southwestern Colorado, we consider a single-well imputation (SWI) and a multiple-well imputation (MWI) approach. SWI gap-fills missing GW entries in a well using the same well's time-series data; MWI gap-fills a specific well's missing GW entry using the time series of neighboring wells. SWI takes advantage of linear interpolation and random forest (RF) approaches, whereas MWI exploits only the RF approach. We also use an information entropy framework to develop insights into how missing data patterns impact imputation. We discovered that if gaps were at random intervals, SWI could accurately impute up to 90% of missing data over an approximately two-year period. Contiguous gaps constituted more complex scenarios for imputation and required the use of MWI. Information entropy suggested that if gaps were contiguous, up to 50% of missing GW data could be estimated accurately over an approximately two-year period. The RF-feature importance suggested that a time feature (months) and a space feature (neighboring wells) were the most important predictors in the SWI and MWI. We also noted that neither SWI nor MWI methods could capture the missing extremes of a hydrograph. To counter this, we developed a new sequential approach and demonstrated the imputation of missing extremes in a GW time series with high accuracy.
Research can be more transparent and collaborative by using Findable, Accessible, Interoperable, and Reusable (FAIR) principles to publish Earth and environmental science data. Reporting formats—instructions, templates, and tools for consistently formatting data within a discipline—can help make data more accessible and reusable. However, the immense diversity of data types across Earth science disciplines makes development and adoption challenging. Here, we describe 11 community reporting formats for a diverse set of Earth science (meta)data including cross-domain metadata (dataset metadata, location metadata, sample metadata), file-formatting guidelines (file-level metadata, CSV files, terrestrial model data archiving), and domain-specific reporting formats for some biological, geochemical, and hydrological data (amplicon abundance tables, leaf-level gas exchange, soil respiration, water and sediment chemistry, sensor-based hydrologic measurements). More broadly, we provide guidelines that communities can use to create new (meta)data formats that integrate with their scientific workflows. Such reporting formats have the potential to accelerate scientific discovery and predictions by making it easier for data contributors to provide (meta)data that are more interoperable and reusable.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.