Abstract:Background
HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous proteins.
Results
We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced va… Show more
“…In the cases where we found discrepancies, the second and third best hits were used to verify the annotation. Finally, the remaining clusters without annotation were compared against the PDB HMM database (113) using hhblits (114). Clusters with less than 10 sequences were first inflated by using the uniclust30 (115) database.…”
The SAR11 clade is one of the most abundant bacterioplankton groups in surface waters of most of the oceans and lakes. However, only 15 SAR11 phages have been isolated thus far, and only one of them belongs to the Myoviridae family (pelagimyophages). Here, we have analyzed 26 sequences of myophages that putatively infect the SAR11 clade. They have been retrieved by mining ca. 45 Gbp aquatic assembled cellular metagenomes and viromes. Most of the myophages were obtained from the cellular fraction (0.2 μm), indicating a bias against this type of virus in viromes. We have found the first myophages that putatively infect Candidatus Fonsibacter (freshwater SAR11) and another group putatively infecting bathypelagic SAR11 phylogroup Ic. The genomes have similar sizes and maintain overall synteny in spite of low average nucleotide identity values, revealing high similarity to marine cyanomyophages. Pelagimyophages recruited metagenomic reads widely from several locations but always much more from cellular metagenomes than from viromes, opposite to what happens with pelagipodophages. Comparing the genomes resulted in the identification of a hypervariable island that is related to host recognition. Interestingly, some genes in these islands could be related to host cell wall synthesis and coinfection avoidance. A cluster of curli-related proteins was widespread among the genomes, although its function is unclear.
IMPORTANCE SAR11 clade members are among the most abundant bacteria on Earth. Their study is complicated by their great diversity and difficulties in being grown and manipulated in the laboratory. On the other hand, and due to their extraordinary abundance, metagenomic data sets provide enormous richness of information about these microbes. Given the major role played by phages in the lifestyle and evolution of prokaryotic cells, the contribution of several new bacteriophage genomes preying on this clade opens windows into the infection strategies and life cycle of its viruses. Such strategies could provide models of attack of large-genome phages preying on streamlined aquatic microbes.
“…In the cases where we found discrepancies, the second and third best hits were used to verify the annotation. Finally, the remaining clusters without annotation were compared against the PDB HMM database (113) using hhblits (114). Clusters with less than 10 sequences were first inflated by using the uniclust30 (115) database.…”
The SAR11 clade is one of the most abundant bacterioplankton groups in surface waters of most of the oceans and lakes. However, only 15 SAR11 phages have been isolated thus far, and only one of them belongs to the Myoviridae family (pelagimyophages). Here, we have analyzed 26 sequences of myophages that putatively infect the SAR11 clade. They have been retrieved by mining ca. 45 Gbp aquatic assembled cellular metagenomes and viromes. Most of the myophages were obtained from the cellular fraction (0.2 μm), indicating a bias against this type of virus in viromes. We have found the first myophages that putatively infect Candidatus Fonsibacter (freshwater SAR11) and another group putatively infecting bathypelagic SAR11 phylogroup Ic. The genomes have similar sizes and maintain overall synteny in spite of low average nucleotide identity values, revealing high similarity to marine cyanomyophages. Pelagimyophages recruited metagenomic reads widely from several locations but always much more from cellular metagenomes than from viromes, opposite to what happens with pelagipodophages. Comparing the genomes resulted in the identification of a hypervariable island that is related to host recognition. Interestingly, some genes in these islands could be related to host cell wall synthesis and coinfection avoidance. A cluster of curli-related proteins was widespread among the genomes, although its function is unclear.
IMPORTANCE SAR11 clade members are among the most abundant bacteria on Earth. Their study is complicated by their great diversity and difficulties in being grown and manipulated in the laboratory. On the other hand, and due to their extraordinary abundance, metagenomic data sets provide enormous richness of information about these microbes. Given the major role played by phages in the lifestyle and evolution of prokaryotic cells, the contribution of several new bacteriophage genomes preying on this clade opens windows into the infection strategies and life cycle of its viruses. Such strategies could provide models of attack of large-genome phages preying on streamlined aquatic microbes.
“…We aligned the sequences of the non-annotated GCs with FAMSA 86 and obtained cluster consensus sequences with the hhconsensus program from HH-SUITE 37 . We used the cluster consensus sequences to perform a nested search against the UniRef90 database (release 2017_11) 90 and NCBI nr database (release 2017_12) 91 to retrieve non-Pfam annotations with…”
Section: Remote Homology Classification Of Gene Clustersmentioning
AbstractBridging the gap between the known and the unknown coding sequence space is one of the biggest challenges in molecular biology today. This challenge is especially extreme in microbiome analyses where between 40% to 60% of the coding sequences detected are of unknown function, and ignoring this fraction limits our understanding of microbial systems. Discarding the uncharacterized fraction is not an option anymore. Here, we present an in-depth exploration of the microbial unknown fraction through the lenses of a conceptual framework and a computational workflow we developed to unify the microbial known and unknown coding sequence space. Our approach partitions the coding sequence space in gene clusters and contextualizes them with genomic and environmental information. We analyzed 415,971,742 genes predicted from 1,749 metagenomes and 28,941 bacterial and archaeal genomes putting into perspective the extent of the unknown fraction, its diversity, and its relevance in a genomic and environmental context. With the identification of a target gene of unknown function for antibiotic resistance, we demonstrate how a contextualized unknown coding sequence space provides a robust framework for the generation of hypotheses that can be used to augment experimental data.
“…Following nine pairwise alignment methods were used for soluble proteins in this study: BLASTP (v2.8.1+), [ 19 ] DIAMOND (v0.9.24), [ 29 ] EMBOSS_Needle (v6.6.0), [ 30 ] EMBOSS_Water (v6.6.0), [ 31 ] LALIGN (FASTA package v36.3.8h), [ 32 ] NW‐align, [ 30 ] and SW‐align [ 31 ] for sequence–sequence alignments; PSI‐BLAST (v2.8.1+) [ 19 ] for sequence–profile alignments; and HH‐suite3 (v3.2.0) [ 33 ] for profile–profile alignments (i.e., hidden Markov model (HMM)‐HMM comparison). DIAMOND was implemented with its sensitive setting (‐‐more‐sensitive) throughout this study although this sensitive mode took longer computation time; DIAMOND with its default (fast) mode sometimes failed to find alignment score.…”
Section: Methodsmentioning
confidence: 99%
“…[ 12 ] Finally, for the HH‐suite3, a HMM profile was generated by (four times) iteratively subjecting domain sequences in the Easy and Hard groups to the MSA using HHblits and Uniclust30. [ 33,34 ]…”
Modeling protein structures is critical for understanding protein functions in various biological and biotechnological studies. Among representative protein structure modeling approaches, template-based modeling (TBM) is by far the most reliable and most widely used approach to model protein structures. However, it still remains as a challenge to select appropriate software programs for pairwise alignments and model building, two major steps of the TBM. In this paper, pairwise alignment methods for TBM are first compared with respect to the quality of structure models built using these methods. This comparative study is conducted using comprehensive datasets, which cover 6185 domain sequences from Structural Classification of Proteins extended for soluble proteins, and 259 Protein Data Bank entries (whole protein sequences) from Orientations of Proteins in Membranes database for membrane proteins. Overall, a profile-based method, especially PSI-BLAST, consistently shows high performance across the datasets and model evaluation metrics used. Next, use of two model building programs, MODELLER and SWISS-MODEL, does not seem to significantly affect the quality of protein structure models built except for the Hard group (a group of relatively less homologous proteins) of membrane proteins. The results presented in this study will be useful for more accurate implementation of TBM.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.