The effective use of metabarcoding in biodiversity science has brought important analytical challenges due to the need to generate accurate taxonomic assignments. The assignment of sequences to genus or species level is critical for biodiversity surveys and biomonitoring, but it is particularly challenging as researchers must select the approach that best recovers information on species composition. This study evaluates the performance and accuracy of seven methods in recovering the species composition of mock communities by using COI barcode fragments. The mock communities varied in species number and specimen abundance, while upstream molecular and bioinformatic variables were held constant, and using a set of COI fragments. We evaluated the impact of parameter optimization on the quality of the predictions. Our results indicate that BLAST top hit competes well with more complex approaches if optimized for the mock community under study. For example, the two machine learning methods that were benchmarked proved more sensitive to reference database heterogeneity and completeness than methods based on sequence similarity. The accuracy of assignments was impacted by both species and specimen counts (query compositional heterogeneity) which ultimately influence the selection of appropriate software. We urge researchers to: (i) use realistic mock communities to allow optimization of parameters, regardless of the taxonomic assignment method employed; (ii) carefully choose and curate the reference databases including completeness; and (iii) use QIIME, BLAST or LCA methods, in conjunction with parameter tuning to better assign taxonomy to diverse communities, especially when information on species diversity is lacking for the area under study.
eneticists have known for more than a decade that their focus on people with European ancestry exacerbates health disparities 1. A 2018 analysis of studies looking for genetic variants associated with disease found that under-representation persists: 78% of study participants were of European ancestry, compared to 10% of Asian ancestry and 2% of African ancestry. Other ancestries each represented less than 1% of the total 2. Several projects, such as H3Africa 3 , are starting to increase participation of under-represented groups, both among participants and among researchers. Large biobanks assembled in Europe and North America, combining biological samples with health-related data, also set sampling targets to increase diversity 4,5,6. But even when data from minority groups are available, many researchers discard them 7 .
BackgroundAssessing protein modularity is important to understand protein evolution. Still the question of the existence of a sub-domain modular architecture remains. We propose a graph-theory approach with significance and power testing to identify modules in protein structures. In the first step, clusters are determined by optimizing the partition that maximizes the modularity score. Second, each cluster is tested for significance. Significant clusters are referred to as modules. Evolutionary modules are identified by analyzing homologous structures. Dynamic modules are inferred from sets of snapshots of molecular simulations. We present here a methodology to identify sub-domain architecture robustly, biologically meaningful, and statistically supported.ResultsThe robustness of this new method is tested using simulated data with known modularity. Modules are correctly identified even when there is a low correlation between landmarks within a module. We also analyzed the evolutionary modularity of a data set of α-amylase catalytic domain homologs, and the dynamic modularity of the Niemann-Pick C1 (NPC1) protein N-terminal domain.The α-amylase contains an (α/β)8 barrel (TIM barrel) with the polysaccharides cleavage site and a calcium-binding domain. In this data set we identified four robust evolutionary modules, one of which forms the minimal functional TIM barrel topology.The NPC1 protein is involved in the intracellular lipid metabolism coordinating sterol trafficking. NPC1 N-terminus is the first luminal domain which binds to cholesterol and its oxygenated derivatives. Our inferred dynamic modules in the protein NPC1 are also shown to match functional components of the protein related to the NPC1 disease.ConclusionsA domain compartmentalization can be found and described in correlation space. To our knowledge, there is no other method attempting to identify sub-domain architecture from the correlation among residues. Most attempts made focus on sequence motifs of protein-protein interactions, binding sites, or sequence conservancy. We were able to describe functional/structural sub-domain architecture related to key residues for starch cleavage, calcium, and chloride binding sites in the α-amylase, and sterol opening-defining modules and disease-related residues in the NPC1. We also described the evolutionary sub-domain architecture of the α-amylase catalytic domain, identifying the already reported minimum functional TIM barrel.
The effective use of metabarcoding in biodiversity science has brought important analytical challenges due to the need to generate accurate taxonomic assignments. The assignment of sequences to a generic or species level is critical for biodiversity surveys and biomonitoring, but it is particularly challenging. Researchers must select the approach that best recovers information on species composition. This study evaluates the performance and accuracy of seven methods in recovering the species composition of mock communities which vary in species number and specimen abundance, while holding upstream molecular and bioinformatic variables constant. It also evaluates the impact of parameter optimization on the quality of the predictions. Despite the general belief that BLAST top hit underperforms newer methods, our results indicate that it competes well with more complex approaches if optimized for the mock community under study. For example, the two machine learning methods that were benchmarked proved more sensitive to the reference database heterogeneity and completeness than methods based on sequence similarity. The accuracy of assignments was impacted by both species and specimen counts which will influence the selection of appropriate software. We urge the usage of realistic mock communities to allow optimization of parameters, regardless of the taxonomic assignment method used.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.