Efficient clustering of large EST data sets on parallel computers

Kalyanaraman, Anantharaman; Aluru, Srinivas; Kothari, Suresh; Brendel, Volker

doi:10.1093/nar/gkg379

Cited by 76 publications

(51 citation statements)

References 15 publications

(17 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, some of the largest species-specific EST collections are from plants, including wheat (Triticum aestivum; more than 415,000), barley (Hordeum vulgare; more than 310,000), soybean (Glycine max; more than 305,000), maize (Zea mays; more than 195,000), and Medicago truncatula (more than 180,000; http://www.ncbi.nlm.nih.gov/dbEST/ dbEST_summary.html). Kalyanaraman et al (2003) present a novel algorithm and software program (PaCE) to cluster large sets of ESTs into contigs that represent distinct gene fragments and its application to 22 plant species EST sets. Our motivation for the mapping of Arabidopsis ESTs onto the Arabidopsis genome was in part derived from the need for a confirmed standard of proven EST clusters against which to gauge the success of EST clustering programs that do not incorporate genome sequence data.…”

Section: Discussionmentioning

confidence: 99%

“…Challenges of EST clustering arise from poor average sequence quality, incomplete EST sampling, polymorphisms, alternative transcript isoforms, representation of highly similar transcripts from distinct members of multigene families, and cloning artifacts. Different strategies for EST clustering and the associated gene indexing databases have been reviewed by Bouck et al (1999); for a recent method for EST clustering on parallel computers, see Kalyanaraman et al (2003).For Arabidopsis, up-to-date EST clusters are available in form of the UniGene clusters at NCBI (http:// www.ncbi.nlm.nih.gov/UniGene/) and as a The Institute for Genome Research (TIGR) Gene Index (AtGI; http://www.tigr.org/tdb/tgi/agi/; Quackenbush et al, 2001). The current UniGene build (no.…”

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Refined Annotation of the Arabidopsis Genome by Complete Expressed Sequence Tag Mapping

2003

View full text Add to dashboard Cite

Expressed sequence tags (ESTs) currently encompass more entries in the public databases than any other form of sequence data. Thus, EST data sets provide a vast resource for gene identification and expression profiling. We have mapped the complete set of 176,915 publicly available Arabidopsis EST sequences onto the Arabidopsis genome using GeneSeqer, a spliced alignment program incorporating sequence similarity and splice site scoring. About 96% of the available ESTs could be properly aligned with a genomic locus, with the remaining ESTs deriving from organelle genomes and non-Arabidopsis sources or displaying insufficient sequence quality for alignment. The mapping provides verified sets of EST clusters for evaluation of EST clustering programs. Analysis of the spliced alignments suggests corrections to current gene structure annotation and provides examples of alternative and non-canonical pre-mRNA splicing. All results of this study were parsed into a database and are accessible via a flexible Web interface at http://www.plantgdb.org/AtGDB/.The efforts of an international collaboration to obtain the complete genome sequence of the flowering plant Arabidopsis resulted in the release and annotation of 115.4 Mb of the genome (estimated at 125 Mb) in December of 2000 (Arabidopsis Genome Initiative, 2000). At that time, 25,498 protein-coding genes were identified in the five haploid chromosomes, but only 9% of these genes had been characterized experimentally, and only 69% could be functionally classified by similarity to proteins of known functions. In the interim, sequencing and annotation has progressed. The most current release of the Arabidopsis genome available at GenBank provides 117.3 Mb and 27,288 annotated protein-coding genes (see Data Sets in "Materials and Methods"). Annotation of the Arabidopsis genome and functional characterization of all the genes is an ongoing effort. Initial, high-throughput computational gene structure prediction has likely been successful in identifying most gene locations; however, these methods still suffer from limitations in predicting the precise gene structure for an entire gene, detection of intergenic regions, and identification of non-coding exon sequences (Pavy et al., 1999;Brendel and Zhu, 2002). Recent studies have concentrated on sequencing of full-length cDNAs to improve genome annotation Seki et al., 2002).Expressed sequence tags (ESTs) are single-pass sequencing reads of cDNA clones that have become a widely employed method for gene identification, expression profiling, and polymorphism analysis. Presently, more than 13.4 million EST entries have been deposited into the National Center for Biotechnology Information (NCBI) dbEST public database, including Arabidopsis with 176,915 ESTs and 21 other species with EST sets of more than 100,000 entries (http://www.ncbi.nlm.nih.gov/dbEST/ dbEST_summary.html). In the absence of a wholegenome sequencing project for a particular species, clustering of ESTs into contigs that represent unique genes is one of the most promi...

show abstract

Section: Discussionmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Refined Annotation of the Arabidopsis Genome by Complete Expressed Sequence Tag Mapping

2003

View full text Add to dashboard Cite

show abstract

“…Vmatch program [32] was used to identify contaminations and repetitive elements by comparison of the mRNA sequences to vector, bacterial and repeat databases. Cleaned EST sequences were first clustered by the PaCE program [33] and then for each clusters, clustering algorithm (CAP3) [34] is used to perform the assembly. In order to minimize such potential false negatives, the above resulted CAP3 contigs/singlets are self-clustered using the Vmatch program.…”

Section: Sequence Datamentioning

confidence: 99%

Genome‐wide analysis of alternative splicing events in Hordeum vulgare: Highlighting retention of intron‐based splicing and its possible function through network analysis

et al. 2015

View full text Add to dashboard Cite

a b s t r a c tIn this study, using homology mapping of assembled expressed sequence tags against the genomic data, we identified alternative splicing events in barley. Results demonstrated that intron retention is frequently associated with specific abiotic stresses. Network analysis resulted in discovery of some specific sub-networks between miRNAs and transcription factors in genes with high number of alternative splicing, such as cross talk between SPL2, SPL10 and SPL11 regulated by miR156 and miR157 families. To confirm the alternative splicing events, elongation factor protein (MLOC_3412) was selected followed by experimental verification of the predicted splice variants by Semi quantitative Reverse Transcription PCR (qRT-PCR). Our novel integrative approach opens a new avenue for functional annotation of alternative splicing through regulatory-based network discovery.

show abstract

“…These ESTs were then clustered using PaCE (Kalyanaraman et al 2003) under default parameters, and contigs were generated using CAP3 from each resulting cluster as previously described. Polymorphic sites with representation in $25% of participating ESTs, which also violated random expectation for sequencing errors (P , 0.01), were selected; 28 primer pairs were designed to flank the 24 previously unreported duplications using Primer3.…”

Section: Methodsmentioning

confidence: 99%

Nearly Identical Paralogs: Implications for Maize (Zea mays L.) Genome Evolution

Emrich

Wen³

et al. 2007

Genetics

View full text Add to dashboard Cite

As an ancient segmental tetraploid, the maize (Zea mays L.) genome contains large numbers of paralogs that are expected to have diverged by a minimum of 10% over time. Nearly identical paralogs (NIPs) are defined as paralogous genes that exhibit $98% identity. Sequence analyses of the ''gene space'' of the maize inbred line B73 genome, coupled with wet lab validation, have revealed that, conservatively, at least $1% of maize genes have a NIP, a rate substantially higher than that in Arabidopsis. In most instances, both members of maize NIP pairs are expressed and are therefore at least potentially functional. Of evolutionary significance, members of many NIP families also exhibit differential expression. The finding that some families of maize NIPs are closely linked genetically while others are genetically unlinked is consistent with multiple modes of origin. NIPs provide a mechanism for the maize genome to circumvent the inherent limitation that diploid genomes can carry at most two ''alleles'' per ''locus.'' As such, NIPs may have played important roles during the evolution and domestication of maize and may contribute to the success of long-term selection experiments in this important crop species.

show abstract

Efficient clustering of large EST data sets on parallel computers

Cited by 76 publications

References 15 publications

Refined Annotation of the Arabidopsis Genome by Complete Expressed Sequence Tag Mapping

Refined Annotation of the Arabidopsis Genome by Complete Expressed Sequence Tag Mapping

Genome‐wide analysis of alternative splicing events in Hordeum vulgare: Highlighting retention of intron‐based splicing and its possible function through network analysis

Nearly Identical Paralogs: Implications for Maize (Zea mays L.) Genome Evolution

Contact Info

Product

Resources

About