Theoretical considerations predict that amplification of expressed gene transcripts by reverse transcription-PCR using arbitrarily chosen primers will result in the preferential amplification of the central portion of the transcript. Systematic, high-throughput sequencing of such products would result in an expressed sequence tag (EST) database consisting of central, generally coding regions of expressed genes. Such a database would add significant value to existing public EST databases, which consist mostly of sequences derived from the extremities of cDNAs, and facilitate the construction of contigs of transcript sequences. We tested our predictions, creating a database of 10,000 sequences from human breast tumors. The data confirmed the central distribution of the sequences, the significant normalization of the sequence population, the frequent extension of contigs composed of existing human ESTs, and the identification of a series of potentially important homologues of known genes. This approach should make a significant contribution to the early identification of important human genes, the deciphering of the draft human genome sequence currently being compiled, and the shotgun sequencing of the human transcriptome.
Transcribed sequences in the human genome can be identified with confidence only by alignment with sequences derived from cDNAs synthesized from naturally occurring mRNAs. We constructed a set of 250,000 cDNAs that represent partial expressed gene sequences and that are biased toward the central coding regions of the resulting transcripts. They are termed ORF expressed sequence tags (ORESTES). The 250,000 ORESTES were assembled into 81,429 contigs. Of these, 1,181 (1.45%) were found to match sequences in chromosome 22 with at least one ORESTES contig for 162 (65.6%) of the 247 known genes, for 67 (44.6%) of the 150 related genes, and for 45 of the 148 (30.4%) EST-predicted genes on this chromosome. Using a set of stringent criteria to validate our sequences, we identified a further 219 previously unannotated transcribed sequences on chromosome 22. Of these, 171 were in fact also defined by EST or full length cDNA sequences available in GenBank but not utilized in the initial annotation of the first human chromosome sequence. Thus despite representing less than 15% of all expressed human sequences in the public databases at the time of the present analysis, ORESTES sequences defined 48 transcribed sequences on chromosome 22 not defined by other sequences. All of the transcribed sequences defined by ORESTES coincided with DNA regions predicted as encoding exons by GENSCAN. (http:͞͞genes.mit.edu/GENSCAN.html). C omplete bacterial genome sequences allow a relatively precise and complete analysis of constituent genes and coding regions by means of direct computational analysis (1). In complex eukaryotic genomes, however, it is proving considerably more difficult to identify genes because of their fragmentation into multiple small exons divided by often considerably larger introns. In this context, the determination of the complete sequence of the human chromosome 22 allowed a detailed appraisal of the efficacy of gene prediction methodologies (2). It was noted that when known genes (where complete cDNA sequences have been determined) were compared with an ab initio prediction of the same region by using the best computational methods available, only 94% of annotated genes were detected. More importantly, in only 20% of cases were all exons exactly predicted, and 16% of all known exons were entirely missed. On the other hand, almost 40% of GENSCAN-predicted genes did not form part of any gene confirmed by other means and include an unknown proportion of false positives (2).In the absence of adequate computational approaches, gene identification will depend on the alignment of finished genomic sequence with sequences from experimentally validated transcripts. Following this approach, Dunham and colleagues (2) were able to identify 247 genes corresponding to fully sequenced transcripts on chromosome 22 that they have denominated Abbreviations: EST, expressed sequence tag; ORESTES, ORF ESTs.cc To whom reprint requests should be addressed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.