Abstract. Pyrosequencing technologies are frequently used for sequencing the 16S rRNA marker gene for metagenomic studies of microbial communities. Computing a pairwise genetic distance matrix from the produced reads is an important but highly time consuming task. In this paper, we present a parallelized tool (called CRiSPy) for scalable pairwise genetic distance matrix computation and clustering that is based on the processing pipeline of the popular ESPRIT software package. To achieve high computational efficiency, we have designed massively parallel CUDA algorithms for pairwise k-mer distance and pairwise genetic distance computation. We have also implemented a memory-efficient sparse matrix clustering program to process the distance matrix. On a single-GPU, CRiSPy achieves speedups of around two orders of magnitude compared to the sequential ESPRIT program for both the time-consuming pairwise genetic distance module and the whole processing pipeline, thus making CRiSPy particularly suitable for high-throughput microbial studies.
De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clustering approach that is computed on a genetic distance matrix derived from an all-against-all read comparison by pairwise sequence alignment. However, most existing dendrogram-based tools have difficulty processing datasets larger than 10,000 unique reads due to high computational complexity. We address this difficulty by developing two efficient algorithms for CRiSPy: a compute-efficient GPU-accelerated parallel algorithm for pairwise distance matrix computation and a memory-efficient hierarchical clustering algorithm. Our experiments on various datasets with distinct attributes show that CRiSPy is able to produce more accurate OTU groupings than most OTU clustering applications.
Biodiversity assessment is an important step in a metagenomic processing pipeline. The biodiversity of a microbial metagenome is often estimated by grouping its 16S rRNA reads into operational taxonomic units or OTUs. These metagenomic datasets are typically large and hence require effective yet accurate computational methods for processing.In this paper, we introduce a new hierarchical clustering method called CRiSPy-Embed which aims to produce highquality clustering results at a low computational cost. We tackle two computational issues of the current OTU hierarchical clustering approach: (1) the compute-intensive sequence alignment operation for building the distance matrix and (2) the quadratic memory requirement of the clustering procedure.Our performance evaluation shows that CRiSPy-Embed achieves higher efficiency in terms of both runtime and memory consumption in comparison to existing dendrogram-based approaches. Furthermore, to obtain the final OTU grouping, CRiSPy-Embed dynamically determines a natural cutoff of the dendrogram. With this strategy, CRiSPy-Embed achieves better and more robust clustering outcomes compared to other notable OTU clustering pipelines.
BackgroundProtein-protein docking is an in silico method to predict the formation of protein complexes. Due to limited computational resources, the protein-protein docking approach has been developed under the assumption of rigid docking, in which one of the two protein partners remains rigid during the protein associations and water contribution is ignored or implicitly presented. Despite obtaining a number of acceptable complex predictions, it seems to-date that most initial rigid docking algorithms still find it difficult or even fail to discriminate successfully the correct predictions from the other incorrect or false positive ones. To improve the rigid docking results, re-ranking is one of the effective methods that help re-locate the correct predictions in top high ranks, discriminating them from the other incorrect ones.In this paper, we propose a new re-ranking technique using a new energy-based scoring function, namely IFACEwat - a combined Interface Atomic Contact Energy (IFACE) and water effect. The IFACEwat aims to further improve the discrimination of the near-native structures of the initial rigid docking algorithm ZDOCK3.0.2. Unlike other re-ranking techniques, the IFACEwat explicitly implements interfacial water into the protein interfaces to account for the water-mediated contacts during the protein interactions.ResultsOur results showed that the IFACEwat increased both the numbers of the near-native structures and improved their ranks as compared to the initial rigid docking ZDOCK3.0.2. In fact, the IFACEwat achieved a success rate of 83.8% for Antigen/Antibody complexes, which is 10% better than ZDOCK3.0.2. As compared to another re-ranking technique ZRANK, the IFACEwat obtains success rates of 92.3% (8% better) and 90% (5% better) respectively for medium and difficult cases. When comparing with the latest published re-ranking method F2Dock, the IFACEwat performed equivalently well or even better for several Antigen/Antibody complexes.ConclusionsWith the inclusion of interfacial water, the IFACEwat improves mostly results of the initial rigid docking, especially for Antigen/Antibody complexes. The improvement is achieved by explicitly taking into account the contribution of water during the protein interactions, which was ignored or not fully presented by the initial rigid docking and other re-ranking techniques. In addition, the IFACEwat maintains sufficient computational efficiency of the initial docking algorithm, yet improves the ranks as well as the number of the near native structures found. As our implementation so far targeted to improve the results of ZDOCK3.0.2, and particularly for the Antigen/Antibody complexes, it is expected in the near future that more implementations will be conducted to be applicable for other initial rigid docking algorithms.
First and foremost, I would like to express my sincere thanks to my supervisors, Prof Bertil Schmidt and Assoc Prof Kwoh Chee Keong for introducing me to the exciting fields of high performance computing and bioinformatics. I am grateful for their support, guidance and suggestions. Their expertise and academic experience have helped me become a capable researcher. In addition, I thanks Prof Kyle Rupnow for his brief but valuable guidance during the period of my PhD qualification examination. Secondly, I would like to sincerely thank Ms. Irene Ng-Goh Siew Lai and Mr. Poliran Kenneth Caballes, the laboratory executives of the Parallel and Distributed Computing Center, for their efficient assistance and instant support in setting up and troubleshooting the hardware resources. In addition, I would like to also thank my best friend, Pham Chau Khoa for many fruitful discussions about computers and programming. His passion for technologies and excellent programming skill have inspired me to become a better programmer. Last but not least, my gratitude goes to my mother, Pham Thi Dieu Huong and my former mentors, Dr. Timo Bretschneider and Dr. Ian McLoughlin for the inspiration and encouragement they have provided me throughout my academic journey.
Golden Camellias have recently been used as a food, cosmetic, and traditional medicine in China and Vietnam. Forty-two species have natural distribution in Vietnam, of which thirty-two species were considered endemic species of this country. The morphology of leaves and flowers of these species were similar; therefore, their taxonomic identification usually needed experts and the authentication has often been confused among species. Our study aims to describe the genetic diversity and the relationship of six species Camellia phanii, Camellia tamdaoensis, Camellia tienii, Camellia flava, Camellia petelotii and Camellia euphlebia by using three chloroplast DNA-barcodes: matK, rbcL and trnH-psbA. We also clarified the significant differences in anatomical characteristics of midvein and blade of their leaves, which suggested the possibility to use these criteria in taxonomy. In addition, preliminary chemical profiles of the methanolic extracts of leaves from six Golden Camellias such as total phenolic content (TPC), total flavonoid content (TFC), total anthocyanin content (TAC) and chlorogenic acids content (TCGAs) also showed the diversity among them. Interestingly, the discrimination on the catechins profile among six species followed the same tendency with the genetic distance on the phylogeny tree suggesting that catechins (i. e., discriminative catechins) can be biomarkers for the chemotaxonomy of these six Golden Camellias.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.