William S Pearman scite author profile

Background: The first step in understanding ecological community diversity and dynamics is quantifying community membership. An increasingly common method for doing so is through metagenomics. Because of the rapidly increasing popularity of this approach, a large number of computational tools and pipelines are available for analysing metagenomic data. However, the majority of these tools have been designed and benchmarked using highly accurate short read data (i.e. Illumina), with few studies benchmarking classification accuracy for long error-prone reads (PacBio or Oxford Nanopore). In addition, few tools have been benchmarked for nonmicrobial communities. Results: Here we compare simulated long reads from Oxford Nanopore and Pacific Biosciences (PacBio) with high accuracy Illumina read sets to systematically investigate the effects of sequence length and taxon type on classification accuracy for metagenomic data from both microbial and non-microbial communities. We show that very generally, classification accuracy is far lower for non-microbial communities, even at low taxonomic resolution (e.g. family rather than genus). We then show that for two popular taxonomic classifiers, long reads can significantly increase classification accuracy, and this is most pronounced for non-microbial communities. Conclusions: This work provides insight on the expected accuracy for metagenomic analyses for different taxonomic groups, and establishes the point at which read length becomes more important than error rate for assigning the correct taxon.

show abstract

MARES, a replicable pipeline and curated reference database for marine eukaryote metabarcoding

Arranz

Pearman

Aguirre

et al. 2020

Sci Data

View full text Add to dashboard Cite

the use of DNa metabarcoding to characterise the biodiversity of environmental and community samples has exploded in recent years. However, taxonomic inferences from these studies are contingent on the quality and completeness of the sequence reference database used to characterise sample species-composition. In response, studies often develop custom reference databases to improve species assignment. the disadvantage of this approach is that it limits the potential for database re-use, and the transferability of inferences across studies. Here, we present the MaRine Eukaryote Species (MaRES) reference database for use in marine metabarcoding studies, created using a transparent and reproducible pipeline. MaRES includes all COI sequences available in GenBank and BOLD for marine taxa, unified into a single taxonomy. Our pipeline facilitates the curation of sequences, synonymization of taxonomic identifiers used by different repositories, and formatting these data for use in taxonomic assignment tools. Overall, MaRES provides a benchmark COI reference database for marine eukaryotes, and a standardised pipeline for (re)producing reference databases enabling integration and fair comparison of marine DNa metabarcoding results.

show abstract

Commonly used Hardy–Weinberg equilibrium filtering schemes impact population structure inferences using RADseq data

Pearman

Urban

Alexander

2022

Molecular Ecology Resources

View full text Add to dashboard Cite

Reduced representation sequencing (RRS) is a widely used method to assay the diversity of genetic loci across the genome of an organism. The dominant class of RRS approaches assay loci associated with restriction sites within the genome (restriction site associated DNA sequencing, or RADseq). RADseq is frequently applied to non‐model organisms since it enables population genetic studies without relying on well‐characterized reference genomes. However, RADseq requires the use of many bioinformatic filters to ensure the quality of genotyping calls. These filters can have direct impacts on population genetic inference, and therefore require careful consideration. One widely used filtering approach is the removal of loci that do not conform to expectations of Hardy–Weinberg equilibrium (HWE). Despite being widely used, we show that this filtering approach is rarely described in sufficient detail to enable replication. Furthermore, through analyses of in silico and empirical data sets we show that some of the most widely used HWE filtering approaches dramatically impact inference of population structure. In particular, the removal of loci exhibiting departures from HWE after pooling across samples significantly reduces the degree of inferred population structure within a data set (despite this approach being widely used). Based on these results, we provide recommendations for best practice regarding the implementation of HWE filtering for RADseq data sets.

show abstract

Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads

Pearman

Freed

Silander

2020

Preprint

View full text Add to dashboard Cite

Background: The first step in understanding ecological community diversity and dynamics is quantifying community membership. An increasingly common method for doing so is through metagenomics. Because of the rapidly increasing popularity of this approach, a large number of computational tools and pipelines are available for analysing metagenomic data. However, the majority of these tools have been designed and benchmarked using highly accurate short read data (i.e. Illumina), with few studies benchmarking classification accuracy for long error-prone reads (PacBio or Oxford Nanopore). In addition, few tools have been benchmarked for non-microbial communities. Results: Here we compare simulated long reads from Oxford Nanopore and Pacific Biosciences with high accuracy Illumina read sets to systematically investigate the effects of sequence length and taxon type on classification accuracy for metagenomic data from both microbial and non-microbial communities. We show that very generally, classification accuracy is far lower for non-microbial communities, even at low taxonomic resolution (e.g. family rather than genus). We then show that for two popular taxonomic classifiers, long reads can significantly increase classification accuracy, and this is most pronounced for non-microbial communities.Conclusions: This work provides insight on the expected accuracy for metagenomic analyses for different taxonomic groups, and establishes the point at which read length becomes more important than error rate for assigning the correct taxon.

show abstract

Southern Hemisphere coasts are biologically connected by frequent, long-distance rafting events

et al. 2022

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.