As DNA sequencing has become more popular, the public genetic repositories where sequences are archived have experienced explosive growth. These repositories now hold invaluable collections of sequences, e.g., for microbial ecology, but whether these data are reusable has not been evaluated. We assessed the availability and state of 16S rRNA gene amplicon sequences archived in public genetic repositories (SRA, EBI, and DDJ). We screened 26,927 publications in 17 microbiology journals, identifying 2015 16S rRNA gene sequencing studies. Of these, 7.2% had not made their data public at the time of analysis. Among a subset of 635 studies sequencing the same gene region, 40.3% contained data which was not available or not reusable, and an additional 25.5% contained faults in data formatting or data labeling, creating obstacles for data reuse. Our study reveals gaps in data availability, identifies major contributors to data loss, and offers suggestions for improving data archiving practices.
The sequencing revolution has resulted in the explosive growth of public genetic repositories. These repositories now hold invaluable collections of 16S rRNA gene amplicon sequences, but the extent to which the currently archived data is findable, accessible, and reusable has not been evaluated. We conducted a field-wide assessment of the availability and state of publicly archived 16S rRNA gene amplicon sequencing data. Using custom-built pattern-based text extraction algorithms, we searched 26,927 publications in 17 microbiology or microbial ecology journals, and identified 2,015 studies which performed 16S rRNA gene amplicon sequencing. We found, for example, that 7.2% of these had not been made public at the time of analysis, a trend which increased over time. Of the 635 studies targeting the V3-V4 region of the 16S rRNA gene, 40.3%
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.