Since the first two complete bacterial genome sequences were published in 1995, the science of bacteria has dramatically changed. Using third-generation DNA sequencing, it is possible to completely sequence a bacterial genome in a few hours and identify some types of methylation sites along the genome as well. Sequencing of bacterial genome sequences is now a standard procedure, and the information from tens of thousands of bacterial genomes has had a major impact on our views of the bacterial world. In this review, we explore a series of questions to highlight some insights that comparative genomics has produced. To date, there are genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. However, the distribution is quite skewed towards a few phyla that contain model organisms. But the breadth is continuing to improve, with projects dedicated to filling in less characterized taxonomic groups. The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas system provides bacteria with immunity against viruses, which outnumber bacteria by tenfold. How fast can we go? Second-generation sequencing has produced a large number of draft genomes (close to 90 % of bacterial genomes in GenBank are currently not complete); third-generation sequencing can potentially produce a finished genome in a few hours, and at the same time provide methlylation sites along the entire chromosome. The diversity of bacterial communities is extensive as is evident from the genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. Genome sequencing can help in classifying an organism, and in the case where multiple genomes of the same species are available, it is possible to calculate the pan- and core genomes; comparison of more than 2000 Escherichia coli genomes finds an E. coli core genome of about 3100 gene families and a total of about 89,000 different gene families. Why do we care about bacterial genome sequencing? There are many practical applications, such as genome-scale metabolic modeling, biosurveillance, bioforensics, and infectious disease epidemiology. In the near future, high-throughput sequencing of patient metagenomic samples could revolutionize medicine in terms of speed and accuracy of finding pathogens and knowing how to treat them.
For comparison of whole-genome (genic ؉ nongenic) sequences, multiple sequence alignment of a few selected genes is not appropriate. One approach is to use an alignment-free method in which feature (or l-mer) frequency profiles (FFP) of whole genomes are used for comparison-a variation of a text or book comparison method, using word frequency profiles. In this approach it is critical to identify the optimal resolution range of l-mers for the given set of genomes compared. The optimum FFP method is applicable for comparing whole genomes or large genomic regions even when there are no common genes with high homology. We outline the method in 3 stages: (i) We first show how the optimal resolution range can be determined with English books which have been transformed into long character strings by removing all punctuation and spaces. (ii) Next, we test the robustness of the optimized FFP method at the nucleotide level, using a mutation model with a wide range of base substitutions and rearrangements. (iii) Finally, to illustrate the utility of the method, phylogenies are reconstructed from concatenated mammalian intronic genomes; the FFP derived intronic genome topologies for each l within the optimal range are all very similar. The topology agrees with the established mammalian phylogeny revealing that intron regions contain a similar level of phylogenic signal as do coding regions.mammalian genome phylogeny ͉ whole-genome comparison ͉ whole-genome phylogeny ͉ whole-intron phylogeny T he comparison of 2 closely related genomes at the base-by-base nucleotide sequence level is accomplished by sequence alignment. However, because species diverge extensively over time, insertions/deletions and genomic rearrangements make straightforward sequence alignment unreliable or impossible. This difficulty is typically overcome by 1 of 2 methods. The first involves extracting a common subset of genes (coding sequences) shared by all of the species compared, then building a multiple sequence alignment (MSA) for each gene, and finally concatenating each alignment into a super MSA (1). The MSA and an appropriate base-substitution model are used to calculate similarity scores. The second method is best described as gene profiling, where the occurrence of each gene in a dictionary of genes is counted, forming a gene presence/ absence profile. The relative frequency difference between genomes from their gene profiles is used to derive a similarity score (2). Both methods rely on the correct definition and selection of common genes to be compared, and significant homology among aligned gene sequences.If, however, the genomes do not share an alignable set of common genes, the alignment-free method is the only option of choice at present. Also, these methods of comparison strictly focus on comparing the coding (coding for protein, and functional RNA) portions of genomes, which can amount to as little as 1% of the genomic sequence in humans (3). As for the noncoding sequence of the genome (the other 99%), much of its function is unknown, ...
We present a whole-proteome phylogeny of prokaryotes constructed by comparing feature frequency profiles (FFPs) of whole proteomes. Features are l-mers of amino acids, and each organism is represented by a profile of frequencies of all features. The selection of feature length is critical in the FFP method, and we have developed a procedure for identifying the optimal feature lengths for inferring the phylogeny of prokaryotes, strictly speaking, a proteome phylogeny. Our FFP trees are constructed with whole proteomes of 884 prokaryotes, 16 unicellular eukaryotes, and 2 random sequences. To highlight the branching order of major groups, we present a simplified proteome FFP tree of monophyletic class or phylum with branch support. In our whole-proteome FFP trees (i) Archaea, Bacteria, Eukaryota, and a random sequence outgroup are clearly separated; (ii) Archaea and Bacteria form a sister group when rooted with random sequences; (iii) Planctomycetes, which possesses an intracellular membrane compartment, is placed at the basal position of the Bacteria domain; (iv) almost all groups are monophyletic in prokaryotes at most taxonomic levels, but many differences in the branching order of major groups are observed between our proteome FFP tree and trees built with other methods; and (v) previously "unclassified" genomes may be assigned to the most likely taxa. We describe notable similarities and differences between our FFP trees and those based on other methods in grouping and phylogeny of prokaryotes.branching order | l-mers | prokaryotic phylogeny | random sequence outgroup | whole-genome phylogeny C urrently, a widely accepted phylogeny and classification of prokaryotes is based on the comparison of genes that encode small subunit ribosomal RNA (SSU rRNA) (1). This method also led to the proposal of three domains of organisms (Archaea, Bacteria, and Eukaryota). The branching order of the three domains with respect to the common origin was inferred by rooting the SSU rRNA tree using anciently duplicated genes (e.g., EF-Tu/ EF-G, ATPase α and β subunits) (2). However, as more gene sequences became available, taxonomic groupings and phylogenies for prokaryotes derived from alternative genes often showed conflict with those based on SSU rRNA (3-6). This conflict is more evident especially for the relationships between taxonomic groups, suggesting that the phylogeny of organisms is irresolvable through phylogenies derived from one or a few selected genes. At best, such phylogenies only reconstruct a possible evolutionary history of the selected gene or gene set-not the history of whole genomes or organisms. It is generally believed that the use of the whole genome/proteome may provide more robust information for inferring the phylogeny of organisms (3-6). This is supported by the observation that phylogenies based on progressively larger gene sets become more consistent and also less sensitive to artifacts from horizontal gene transfer (7). However, whole-genome/ proteome comparison cannot be accomplished for a large po...
It has been 30 years since the initial emergence and subsequent rapid global spread of multidrug-resistant Salmonella enterica serovar Typhimurium DT104 (MDR DT104). Nonetheless, its origin and transmission route have never been revealed. We used whole-genome sequencing (WGS) and temporally structured sequence analysis within a Bayesian framework to reconstruct temporal and spatial phylogenetic trees and estimate the rates of mutation and divergence times of 315 S. Typhimurium DT104 isolates sampled from 1969 to 2012 from 21 countries on six continents. DT104 was estimated to have emerged initially as antimicrobial susceptible in ∼1948 (95% credible interval [CI], 1934 to 1962) and later became MDR DT104 in ∼1972 (95% CI, 1972 to 1988) through horizontal transfer of the 13-kb Salmonella genomic island 1 (SGI1) MDR region into susceptible strains already containing SGI1. This was followed by multiple transmission events, initially from central Europe and later between several European countries. An independent transmission to the United States and another to Japan occurred, and from there MDR DT104 was probably transmitted to Taiwan and Canada. An independent acquisition of resistance genes took place in Thailand in ∼1975 (95% CI, 1975 to 1990). In Denmark, WGS analysis provided evidence for transmission of the organism between herds of animals. Interestingly, the demographic history of Danish MDR DT104 provided evidence for the success of the program to eradicate Salmonella from pig herds in Denmark from 1996 to 2000. The results from this study refute several hypotheses on the evolution of DT104 and suggest that WGS may be useful in monitoring emerging clones and devising strategies for prevention of Salmonella infections.
We have constructed a map of the ''protein structure space'' by using the pairwise structural similarity scores calculated for all nonredundant protein structures determined experimentally. As expected, proteins with similar structures clustered together in the map and the overall distribution of structural classes of this map followed closely that of the map of the ''protein fold space'' we have reported previously. Consequently, proteins sharing similar molecular functions also were found to colocalize in the protein structure space map, pointing toward a previously undescribed scheme for structure-based functional inference for remote homologues based on the proximity in the map of the protein structure space. We found that this scheme consistently outperformed other predictions made by using either the raw scores or normalized Z-scores of pairwise DALI structure alignment. global map of protein universe ͉ multivariate analysis ͉ protein function prediction ͉ protein structure universe T he molecular functions of a protein can be inferred from either its sequence or structure information. Sequence-based function inference methods annotate molecular function of a protein from its sequence homologues. Most genome-wide functional annotations are carried out with this scheme, by using sequence alignment tools such as BLAST (1), or motif͞profile-based search tools such as PROSITE (2, 3) and PFAM (4, 5). However, when two functionally similar proteins do not share detectable sequence homology, molecular function cannot be inferred based solely on sequence information. Low sequence homology results either from an early branching point at the protein evolution (also known as remote homologues) or a convergent evolution. Many studies were focused on the detection of remote homologues (6-8). In general, methods using statistical models extracted from multiply aligned sequences perform better than pairwise sequence comparison methods (9). However, even these improved methods fail to recognize remote homologues with sequence identity Ͻ25-30%, which is estimated to be Ͼ25% of all sequenced proteins.Structure-based function inference, however, depends less on sequence information. During protein evolution, homology on sequence level is far less preserved compared with homology on structure level. Because proteins fold into specific structures to perform their molecular functions, structure-based functional inference is able to characterize remote homologous relationships of proteins that are impossible to detect by using sequences. By using different random sampling methods and similarity measuring functions, a large number of structural alignment algorithms have been developed to measure similarity of a pair of protein structures. Among these algorithms, DALI (10), SSAP (11), CE (12), and VAST (13) have been widely used, and their performances have been assessed [see Koehl (14) for a review].The issue of predicting the function of remote homologues has become more prominent recently: the Structural Genomics initiative (15...
The vast sequence divergence among different virus groups has presented a great challenge to alignment-based sequence comparison among different virus families. Using an alignment-free comparison method, we construct the whole-proteome phylogeny for a population of viruses from 11 viral families comprising 142 large dsDNA eukaryote viruses. The method is based on the feature frequency profiles (FFP), where the length of the feature (l-mer) is selected to be optimal for phylogenomic inference. We observe that (i) the FFP phylogeny segregates the population into clades, the membership of each has remarkable agreement with current classification by the International Committee on the Taxonomy of Viruses, with one exception that the mimivirus joins the phycodnavirus family; (ii) the FFP tree detects potential evolutionary relationships among some viral families; (iii) the relative position of the 3 herpesvirus subfamilies in the FFP tree differs from gene alignment-based analysis; (iv) the FFP tree suggests the taxonomic positions of certain ''unclassified'' viruses; and (v) the FFP method identifies candidates for horizontal gene transfer between virus families. alignment-free genome comparison ͉ feature frequency profile ͉ horizontal gene transfer ͉ whole-genome phylogeny ͉ virus phylogeny
The bacterial microbiota of plants is diverse, with 1000s of operational taxonomic units (OTUs) associated with any individual plant. In this work, we used phenotypic analysis, comparative genomics, and metabolic models to investigate the differences between 19 sequenced Pseudomonas fluorescens strains. These isolates represent a single OTU and were collected from the rhizosphere and endosphere of Populus deltoides. While no traits were exclusive to either endosphere or rhizosphere P. fluorescens isolates, multiple pathways relevant for plant-bacterial interactions are enriched in endosphere isolate genomes. Further, growth phenotypes such as phosphate solubilization, protease activity, denitrification and root growth promotion are biased toward endosphere isolates. Endosphere isolates have significantly more metabolic pathways for plant signaling compounds and an increased metabolic range that includes utilization of energy rich nucleotides and sugars, consistent with endosphere colonization. Rhizosphere P. fluorescens have fewer pathways representative of plant-bacterial interactions but show metabolic bias toward chemical substrates often found in root exudates. This work reveals the diverse functions that may contribute to colonization of the endosphere by bacteria and are enriched among closely related isolates.
Ten complete mammalian genome sequences were compared by using the ''feature frequency profile'' (FFP) method of alignmentfree comparison. This comparison technique reveals that the whole nongenic portion of mammalian genomes contains evolutionary information that is similar to their genic counterparts-the intron and exon regions. We partitioned the complete genomes of mammals (such as human, chimp, horse, and mouse) into their constituent nongenic, intronic, and exonic components. Phylogenic species trees were constructed for each individual component class of genome sequence data as well as the whole genomes by using standard tree-building algorithms with FFP distances. The phylogenies of the whole genomes and each of the component classes (exonic, intronic, and nongenic regions) have similar topologies, within the optimal feature length range, and all agree well with the evolutionary phylogeny based on a recent large dataset, multispecies, and multigene-based alignment. In the strictest sense, the FFP-based trees are genome phylogenies, not species phylogenies. However, the species phylogeny is highly related to the whole-genome phylogeny. Furthermore, our results reveal that the footprints of evolutionary history are spread throughout the entire length of the whole genome of an organism and are not limited to genes, introns, or short, highly conserved, nongenic sequences that can be adversely affected by factors (such as a choice of sequences, homoplasy, and different mutation rates) resulting in inconsistent species phylogenies.alignment-free genome comparison ͉ feature frequency profile (FFP) ͉ mammalian phylogeny ͉ noncoding DNA ͉ nongenic regions of the genome T he current understanding of mammalian genomes (and of higher order eukaryotes in general) is primarily a ''gene centric'' view. As a result, genome comparisons among mammals have been gene based, and highly conserved genes are preferentially used to infer species divergence. However, the coding (coding for proteins, ribosomal RNAs, transfer RNAs, and other functional RNAs) portions of mammalian genomes can amount to as little as 1-3% of the whole genomic sequence, and it is debatable whether species phylogenies derived from a small, alignable subfraction of the whole genome are reliable. As for the noncoding sequence (the other 99%), much of its function is unknown, yet much of this portion is indeed transcribed. Recently, the ENCODE project showed that at least 93% of analyzed human genome nucleotides were transcribed into RNA when all various cell types were considered (1). Similarly, transcriptional analysis of human chromosomes demonstrated that transcripts originating from the nongenic regions comprise the largest fraction of the transcriptional output of the human genome (2). We have operationally defined a nongenic region to be those regions that have not been annotated to contain a gene in the GenBank records. Some known features in the nongenic sequence include transposable elements and sequences whose transcripts are long noncoding RNAs...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.