Genomic data are being produced and archived at a prodigious rate, and current studies could become historical baselines for future global genetic diversity analyses and monitoring programs. However, when we evaluated the potential utility of genomic data from wild and domesticated eukaryote species in the world’s largest genomic data repository, we found that most archived genomic datasets (86%) lacked the spatiotemporal metadata necessary for genetic biodiversity surveillance. Labor-intensive scouring of a subset of published papers yielded geospatial coordinates and collection years for only 33% (39% if place names were considered) of these genomic datasets. Streamlined data input processes, updated metadata deposition policies, and enhanced scientific community awareness are urgently needed to preserve these irreplaceable records of today’s genetic biodiversity and to plug the growing metadata gap.
Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species resilience to changing climate, its measurement is relevant to many national and global conservation policy targets. Many studies produce large amounts of genome-scale genetic diversity data for wild populations, but most (87%) do not include the associated spatial and temporal metadata necessary for them to be reused in monitoring programs or for acknowledging the sovereignty of nations or Indigenous peoples. We undertook a distributed datathon to quantify the This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.
Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species and population resilience to changing climate, its measurement is relevant to many national and global conservation policy targets. Many studies of evolutionary biology, molecular ecology and conservation genetics produce large amounts of genome-scale genetic diversity data for wild populations. While open data policies have ensured an abundance of freely available genomic data stored in the databases of the International Nucleotide Sequence Database Collaboration (INSDC), only about 13% of current accessions have the associated spatial and temporal metadata in INSDC necessary to be reused in monitoring programs, macrogenetic studies, or for acknowledging the sovereignty of nations or Indigenous Peoples. We undertook a “distributed datathon” to quantify the availability of these missing metadata in sources external to the INSDC and to test the hypothesis that these metadata decay with time. We also worked to remediate these missing metadata by extracting them, when present, from associated published papers, online repositories, and/or from direct communication with authors. Starting with 848 programmatically identified candidate datasets (INSDC BioProjects), we manually determined that 492 contained samples from wild populations. We successfully restored spatiotemporal metadata (locality name and/or geospatial coordinates and collection year) for 82% of these 492 datasets (N = 401 BioProjects comprising 42,104 individuals or BioSamples). We also quantified the availability of 33 additional categories of metadata in sources external to the INSDC. Information about associated publications and the type of habitat from which the samples were taken was the most easily found; information about sampling permits was the most challenging to locate. Looking at papers and online repositories was much more fruitful than contacting authors, who only replied to our email requests 45% of the time. Overall, 23% of our email queries to authors discovered useful metadata. Importantly, we found that the probability of retrieving spatiotemporal metadata declines significantly with the age of the dataset, with a 13.5% yearly decrease for metadata located in published papers or online repositories and up to a 22% yearly decrease for metadata that were only available from authors. This observable metadata decay, mirrored in studies of other types of biological data, should motivate swift updates to data sharing policies and researcher practices to ensure that the valuable context provided by metadata is not lost forever.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.