As a first step in establishing a proteome database for maize, we have embarked on the identification of the leaf proteins resolved on two-dimensional (2-D) gels. We detected nearly 900 spots on the gels with a pH 4-7 gradient and over 200 spots on the gels with a pH 6-11 gradient when the proteins were visualized with colloidal Coomassie blue. Peptide mass fingerprints for 300 protein spots were obtained with matrix assisted laser desorption/ionization-time of flight (MALDI-TOF) mass spectrometer and 149 protein spots were identified using the protein databases. We also searched the pdbEST databases to identify the leaf proteins and verified 66% of the protein spots that had been identified using the protein databases. Sixty-seven additional protein spots were identified from expressed sequence tags (ESTs). Many abundant leaf proteins are present in multiple spots. Functions of over 50% of the abundant leaf proteins are either unknown or hypothetical. Our results show that EST databases in conjunction with peptide mass fingerprints can be used for identifying proteins from organisms with incomplete genome sequence information.
Clustering expressed sequence tags (ESTs) is a powerful strategy for gene identification, gene expression studies and identifying important genetic variations such as single nucleotide polymorphisms. To enable fast clustering of large-scale EST data, we developed PaCE (for Parallel Clustering of ESTs), a software program for EST clustering on parallel computers. In this paper, we report on the design and development of PaCE and its evaluation using Arabidopsis ESTs. The novel features of our approach include: (i) design of memory efficient algorithms to reduce the memory required to linear in the size of the input, (ii) a combination of algorithmic techniques to reduce the computational work without sacrificing the quality of clustering, and (iii) use of parallel processing to reduce run-time and facilitate clustering of larger data sets. Using a combination of these techniques, we report the clustering of 168 200 Arabidopsis ESTs in 15 min on an IBM xSeries cluster with 30 dual-processor nodes. We also clustered 327 632 rat ESTs in 47 min and 420 694 Triticum aestivum ESTs in 3 h and 15 min. We demonstrate the quality of our software using benchmark Arabidopsis EST data, and by comparing it with CAP3, a software widely used for EST assembly. Our software allows clustering of much larger EST data sets than is possible with current software. Because of its speed, it also facilitates multiple runs with different parameters, providing biologists a tool to better analyze EST sequence data. Using PaCE, we clustered EST data from 23 plant species and the results are available at the PlantGDB website.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.