Abstract:A comprehensive record of research data provenance is essential for the successful curation, management, and reuse of data over time. However, creating such detailed metadata can be onerous, and there are few structured methods for doing so. In this case study of data curation in support of geobiology research conducted at Yellowstone National Park, we describe a method of "Research Process Modeling" for documenting noncomputational data provenance in a structured yet flexible way. The method combines systems … Show more
“…Detailed description of all protocols and techniques used for field collection, biomolecule extractions, and meta-omic analyses are presented in the Supplementary Information and briefly summarized here. Detailed descriptions of the experimental design and metadata curation strategies adopted for all aspects of the field and laboratory analyses in the present study are presented in the works of Palmer et al (2017) and Thomer et al (2018).…”
The evolutionarily ancient Aquificales bacterium Sulfurihydrogenibium spp. dominates filamentous microbial mat communities in shallow, fast-flowing, and dysoxic hot-spring drainage systems around the world. In the present study, field observations of these fettuccini-like microbial mats at Mammoth Hot Springs in Yellowstone National Park are integrated with geology, geochemistry, hydrology, microscopy, and multi-omic molecular biology analyses. Strategic sampling of living filamentous mats along with the hot-spring CaCO3 (travertine) in which they are actively being entombed and fossilized has permitted the first direct linkage of Sulfurihydrogenibium spp. physiology and metabolism with the formation of distinct travertine streamer microbial biomarkers. Results indicate that, during chemoautotrophy and CO2 carbon fixation, the 87–98% Sulfurihydrogenibium-dominated mats utilize chaperons to facilitate enzyme stability and function. High-abundance transcripts and proteins for type IV pili and extracellular polymeric substances (EPSs) are consistent with their strong mucus-rich filaments tens of centimeters long that withstand hydrodynamic shear as they become encrusted by more than 5 mm of travertine per day. Their primary energy source is the oxidation of reduced sulfur (e.g., sulfide, sulfur, or thiosulfate) and the simultaneous uptake of extremely low concentrations of dissolved O2 facilitated by bd-type cytochromes. The formation of elevated travertine ridges permits the Sulfurihydrogenibium-dominated mats to create a shallow platform from which to access low levels of dissolved oxygen at the virtual exclusion of other microorganisms. These ridged travertine streamer microbial biomarkers are well preserved and create a robust fossil record of microbial physiological and metabolic activities in modern and ancient hot-spring ecosystems.
“…Detailed description of all protocols and techniques used for field collection, biomolecule extractions, and meta-omic analyses are presented in the Supplementary Information and briefly summarized here. Detailed descriptions of the experimental design and metadata curation strategies adopted for all aspects of the field and laboratory analyses in the present study are presented in the works of Palmer et al (2017) and Thomer et al (2018).…”
The evolutionarily ancient Aquificales bacterium Sulfurihydrogenibium spp. dominates filamentous microbial mat communities in shallow, fast-flowing, and dysoxic hot-spring drainage systems around the world. In the present study, field observations of these fettuccini-like microbial mats at Mammoth Hot Springs in Yellowstone National Park are integrated with geology, geochemistry, hydrology, microscopy, and multi-omic molecular biology analyses. Strategic sampling of living filamentous mats along with the hot-spring CaCO3 (travertine) in which they are actively being entombed and fossilized has permitted the first direct linkage of Sulfurihydrogenibium spp. physiology and metabolism with the formation of distinct travertine streamer microbial biomarkers. Results indicate that, during chemoautotrophy and CO2 carbon fixation, the 87–98% Sulfurihydrogenibium-dominated mats utilize chaperons to facilitate enzyme stability and function. High-abundance transcripts and proteins for type IV pili and extracellular polymeric substances (EPSs) are consistent with their strong mucus-rich filaments tens of centimeters long that withstand hydrodynamic shear as they become encrusted by more than 5 mm of travertine per day. Their primary energy source is the oxidation of reduced sulfur (e.g., sulfide, sulfur, or thiosulfate) and the simultaneous uptake of extremely low concentrations of dissolved O2 facilitated by bd-type cytochromes. The formation of elevated travertine ridges permits the Sulfurihydrogenibium-dominated mats to create a shallow platform from which to access low levels of dissolved oxygen at the virtual exclusion of other microorganisms. These ridged travertine streamer microbial biomarkers are well preserved and create a robust fossil record of microbial physiological and metabolic activities in modern and ancient hot-spring ecosystems.
“…Further details of annotation, training, alignment, and curation are documented later in this paper in detail sufficient to understand the provenance of the dataset (Gebru et al, 2018; Thomer, Wickett, Baker, Fouke, & Palmer, 2018).…”
Software contributions to academic research are relatively invisible, especially to the formalized scholarly reputation system based on bibliometrics. In this article, we introduce a gold‐standard dataset of software mentions from the manual annotation of 4,971 academic PDFs in biomedicine and economics. The dataset is intended to be used for automatic extraction of software mentions from PDF format research publications by supervised learning at scale. We provide a description of the dataset and an extended discussion of its creation process, including improved text conversion of academic PDFs. Finally, we reflect on our challenges and lessons learned during the dataset creation, in hope of encouraging more discussion about creating datasets for machine learning use.
“…In applying RPM, we made adaptations to the information collection and description tools. Thomer et al (2018) note that their intention was to focus on process flow, while ours is to highlight information flow, particularly for processes involving data interpretation. Thus, we chose to add many more artifacts to our activity diagram, to emphasize the role of outside information in the analysis and interpretation processes.…”
To design knowledge bases that effectively address desired reasoning goals, knowledge engineering requires a detailed description of information flow throughout the reasoning processes. Most existing workflow modeling technologies do not provide sufficient detail for projects where cognitive reasoning and field‐ or lab‐based data collection are important components. Research Process Modeling (RPM) was developed to support curation and data lifecycle needs, providing user‐targeted documentation on processes, agents, and artifacts, within research projects that include both computational and field‐ or lab‐based processes. We demonstrate the value of RPM to support the design of a knowledge engineering application within 3‐D geologic mapping, by documenting and describing information flow through a complex research project involving field‐, computation‐, and cognitive process‐generated data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.