The frequencies of each of the 257 468 complete protein coding sequences (CDSs) have been compiled from the taxonomical divisions of the GenBank DNA sequence database. The sum of the codons used by 8792 organisms has also been calculated. The data files can be obtained from the anonymous ftp sites of DDBJ, Kazusa and EBI. A list of the codon usage of genes and the sum of the codons used by each organism can be obtained through the web site http://www.kazusa.or.jp/codon/. The present study also reports recent developments on the WWW site. The new web interface provides data in the CodonFrequency-compatible format as well as in the traditional table format. The use of the database is facilitated by keyword based search analysis and the availability of codon usage tables for selected genes from each species. These new tools will provide users with the ability to further analyze for variations in codon usage among different genomes.
The Rice Annotation Project Database (RAP-DB, http://rapdb.dna.affrc.go.jp/) has been providing a comprehensive set of gene annotations for the genome sequence of rice, Oryza sativa (japonica group) cv. Nipponbare. Since the first release in 2005, RAP-DB has been updated several times along with the genome assembly updates. Here, we present our newest RAP-DB based on the latest genome assembly, Os-Nipponbare-Reference-IRGSP-1.0 (IRGSP-1.0), which was released in 2011. We detected 37,869 loci by mapping transcript and protein sequences of 150 monocot species. To provide plant researchers with highly reliable and up to date rice gene annotations, we have been incorporating literature-based manually curated data, and 1,626 loci currently incorporate literature-based annotation data, including commonly used gene names or gene symbols. Transcriptional activities are shown at the nucleotide level by mapping RNA-Seq reads derived from 27 samples. We also mapped the Illumina reads of a Japanese leading japonica cultivar, Koshihikari, and a Chinese indica cultivar, Guangluai-4, to the genome and show alignments together with the single nucleotide polymorphisms (SNPs) and gene functional annotations through a newly developed browser, Short-Read Assembly Browser (S-RAB). We have developed two satellite databases, Plant Gene Family Database (PGFD) and Integrative Database of Cereal Gene Phylogeny (IDCGP), which display gene family and homologous gene relationships among diverse plant species. RAP-DB and the satellite databases offer simple and user-friendly web interfaces, enabling plant and genome researchers to access the data easily and facilitating a broad range of plant research topics.
With the increasing amount of available genome sequences, novel tools are needed for comprehensive analysis of species-specific sequence characteristics for a wide variety of genomes. We used an unsupervised neural network algorithm, a self-organizing map (SOM), to analyze di-, tri-, and tetranucleotide frequencies in a wide variety of prokaryotic and eukaryotic genomes. The SOM, which can cluster complex data efficiently, was shown to be an excellent tool for analyzing global characteristics of genome sequences and for revealing key combinations of oligonucleotides representing individual genomes. From analysis of 1-and 10-kb genomic sequences derived from 65 bacteria (a total of 170 Mb) and from 6 eukaryotes (460 Mb), clear species-specific separations of major portions of the sequences were obtained with the di-, tri-, and tetranucleotide SOMs. The unsupervised algorithm could recognize, in most 10-kb sequences, the species-specific characteristics (key combinations of oligonucleotide frequencies) that are signature features of each genome. We were able to classify DNA sequences within one and between many species into subgroups that corresponded generally to biological categories. Because the classification power is very high, the SOM is an efficient and fundamental bioinformatic strategy for extracting a wide range of genomic information from a vast amount of sequences.[Supplemental material is available online at www.genome.org.]In addition to protein-coding information, genome sequences contain a wealth of information of interest in many fields of biology, from molecular evolution to genome engineering. G+C% has been used as a fundamental characteristic of individual genomes, but the G+C% is apparently too simple a parameter to differentiate a wide variety of genomes of known sequences. Oligonucleotide frequency can be used to distinguish genomes, because oligonucleotide frequencies vary significantly among genomes; dinucleotide frequencies, for example, are shown to be genome signatures for both prokaryotes and eukaryotes (Nussinov 1984;Karlin et al. 1997;Karlin 1998;Gentles and Karlin 2001). Comprehensive analyses of oligonucleotide frequencies in a wide variety of genomes are thought to provide fundamental knowledge of individual genomes, namely, key combinations of oligonucleotides responsible for the biological properties of the different genomes and genome portions. We applied Kohonen's self-organizing map (SOM) to create graphical representations of oligonucleotide frequencies from which we could extract a wide range of genomic information. The unsupervised neural network algorithm is an effective tool for clustering and visualizing high-dimensional data; it converts complex nonlinear relations among high-dimensional data into simple geometric relations that can be viewed in two dimensions (Kohonen 1982(Kohonen , 1990Kohonen et al. 1996).We and others have used SOMs to characterize codon usage patterns of a wide variety of bacteria (Kanaya et al. 1998;Wang et al. 2001). We introduced a new feature ...
Recent awareness that most microorganisms in the environment are resistant to cultivation has prompted scientists to directly clone useful genes from environmental metagenomes. Two screening methods are currently available for the metagenome approach, namely, nucleotide sequence-based screening and enzyme activity-based screening. Here we have introduced and optimized a third option for the isolation of novel catabolic operons, that is, substrate-induced gene expression screening (SIGEX). This method is based on the knowledge that catabolic-gene expression is generally induced by relevant substrates and, in many cases, controlled by regulatory elements situated proximate to catabolic genes. For SIGEX to be high throughput, we constructed an operon-trap gfp-expression vector available for shotgun cloning that allows for the selection of positive clones in liquid cultures by fluorescence-activated cell sorting. The utility of SIGEX was demonstrated by the cloning of aromatic hydrocarbon-induced genes from a groundwater metagenome library and subsequent genome-informatics analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.