Background The influenza A virus has two basic modes of evolution. Because of a high error rate in the process of replication by RNA polymerase, the viral genome drifts via accumulated mutations. The second mode of evolution is termed a shift, which results from the reassortment of the eight segments of this virus. When two different influenza viruses co‐infect the same host cell, new virions can be released that contain segments from both parental strains. This type of shift has been the source of at least two of the influenza pandemics in the 20th century (H2N2 in 1957 and H3N2 in 1968). Objectives The methods to measure these genetic shifts have not yet provided a quantitative answer to questions such as: what is the rate of genetic reassortment during a local epidemic? Are all possible reassortments equally likely or are there preferred patterns? Methods To answer these questions and provide a quantitative way to measure genetic shifts, a new method for detecting reassortments from nucleotide sequence data was created that does not rely upon phylogenetic analysis. Two different sequence databases were used: human H3N2 viruses isolated in New York State between 1995 and 2006, and human H3N2 viruses isolated in New Zealand between 2000 and 2005. Results Using this new method, we were able to reproduce all the reassortments found in earlier works, as well as detect, with very high confidence, many reassortments that were not detected by previous authors. We obtain a lower bound on the reassortment rate of 2–3 events per year, and find a clear preference for reassortments involving only one segment, most often hemagglutinin or neuraminidase. At a lower frequency several segments appear to reassort in vivo in defined groups as has been suggested previously in vitro. Conclusions Our results strongly suggest that the patterns of reassortment in the viral population are not random. Deciphering these patterns can be a useful tool in attempting to understand and predict possible influenza pandemics.
A search of the influenza virus genome database reveals anomalies associated with a nonnegligible number of submitted sequences. There are many pairs of viral segments that are very close to each other in nucleotide sequence but relatively far apart in reported time of isolation, resulting in an abnormally low evolutionary rate. Also, some sequences show clear evidence of apparent homologous recombination, a process normally assumed to be extremely rare or nonexistent in this virus. These findings may point to surprising new biology but are perhaps more readily explained by stock contamination or other errors in the sequencing laboratories.In the last few years, an extraordinary amount of influenza virus genomic sequence has been submitted to publicly available databases (see, e.g., http://www.ncbi.nlm.nih.gov/genomes /FLU/FLU.html, http://www.flu.lanl.gov, and http://influenza .genomics.org.cn). For instance, there are now over 3,300 full genome sets in the NCBI's rapidly growing Influenza Virus Resource. To our knowledge, no systematic attempt has been made to assess the quality of sequence data in this and similar collections. Our observations show that a fraction of the sequences in the database exhibit anomalous properties that point to either radically new biology or, more likely, problems with the data. As a first example, we consider the rate of nucleotide substitution in the influenza A virus. This rate has been previously estimated at 0.001 to 0.007 per nucleotide per year. (There have been many studies analyzing influenza virus evolutionary rates in different segments and different hosts; see, among others, references 6, 8, 9, 10, 11, 14, and 16.) Using the most conservative (lowest) estimate, we still find many pairs of virus segments that are far closer to each other in nucleotide space than would randomly occur in a Poisson process with this evolutionary rate, given the difference in time of isolation. Such sequences appear to be effectively "frozen in time." For instance, the PB2 segments of isolates A/duck/Taiwan/0526/1972(H6N1) and A/chicken/Taiwan/G23/87(H6N1) differ in only 1 nucleotide position out of 2,283 aligned nucleotides, whereas the expected number of differences, at 0.0015 substitution per nucleotide per year, would be ϳ48 for 15 years. For a null Poisson process, this gives an extremely low P value of 6.6 ϫ 10 Ϫ20
The degeneracy of codons allows a multitude of possible sequences to code for the same protein. Hidden within the particular choice of sequence for each organism are over 100 previously undiscovered biologically significant, short oligonucleotides (length, 2 to 7 nucleotides). We present an information-theoretic algorithm that finds these novel signals. Applying this algorithm to the 209 sequenced bacterial genomes in the NCBI database, we determine a set of oligonucleotides for each bacterium which uniquely characterizes the organism. Some of these signals have known biological functions, like restriction enzyme binding sites, but most are new. An accompanying scoring algorithm is introduced that accurately (92%) places sequences of 100 kb with their correct species among the choice of hundreds. This algorithm also does far better than previous methods at relating phage genomes to their bacterial hosts, suggesting that the lists of oligonucleotides are "genomic fingerprints" that encode information about the effects of the cellular environment on DNA sequence. Our approach provides a novel basis for phylogeny and is potentially ideally suited for classifying the short DNA fragments obtained by environmental shotgun sequencing. The methods developed here can be readily extended to other problems in bioinformatics.Genome analysis has uncovered many sequence differences among organisms. Both mononucleotide and dinucleotide content, as well as codon usage, vary widely among genomes (6). The size of even small bacterial genomes is statistically sufficient to determine a substantially richer set of sequence-based features describing each organism. However, many of these features have remained elusive, in the coding regions in particular, due to complicated constraints. Each (protein-coding) gene encodes a particular protein, which constrains its possible nucleotide sequence. Because the genetic code is degenerate, this constraint still allows for an enormous number of possible DNA sequences for each gene. Also, the overall codon usage in each gene is known to have strong biological consequences, possibly determined by isoaccepting tRNA abundances (5). In order to isolate new features within the coding regions, these constraints must be factored out.To solve this problem, we create a background genome that shares exactly the above-described constraints with the real genome but is otherwise random (4). The background genome encodes all the same proteins, and the codon usage is precisely matched for each gene. The hidden features for which we are searching are contained in the differences between the background genome and the real genome. The problem is reduced to extracting these differences.We have incorporated information theory into an algorithm to systematically compute the over-and underrepresented strings of nucleotides (words) in the real genome compared to those of the background (see Materials and Methods for details). A major difficulty in finding these words is that they are not independent. For example, if ...
A new algorithm has been constructed for finding under- and overrepresented oligonucleotide motifs in the protein coding regions of genomes that have been normalized for G/C content, codon usage, and amino acid order. This Robins-Krasnitz algorithm has been employed to compare the oligonucleotide frequencies between many different prokaryotic genomes. Evidence is presented demonstrating that at least some of these sequence motifs are functionally important and selected for or against during the evolution of these prokaryotes. The applications of this method include the optimization of protein expression for synthetic genes in foreign organisms, identification of novel oligonucleotide signals used by the organism and the examination of evolutionary relationships not dependent upon different gene sequence trees.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.