The location of a protein in a cell is closely correlated with its biological function. Based on the concept that the protein subcellular location is mainly determined by its amino acid and pseudo amino acid composition (PseAA), a new algorithm of increment of diversity combined with support vector machine is proposed to predict the protein subcellular location. The subcellular locations of plant and non-plant proteins are investigated by our method. The overall prediction accuracies in jackknife test are 88.3% for the eukaryotic plant proteins and 92.4% for the eukaryotic non-plant proteins, respectively. In order to estimate the effect of the sequence identity on predictive result, the proteins with sequence identity
Acer truncatum (purpleblow maple) is a woody tree species that produces seeds with high levels of valuable fatty acids (especially nervonic acid). However, the lack of a complete genome sequence has limited both basic and applied research on A. truncatum. We describe a high-quality draft genome assembly comprising 633.28 Mb (contig N50 = 773.17 kb; scaffold N50 = 46.36 Mb) with at least 28 438 predicted genes. The genome underwent an ancient triplication, similar to the core eudicots, but there have been no recent wholegenome duplication events. Acer yangbiense and A. truncatum are estimated to have diverged about 9.4 million years ago. A combined genomic, transcriptomic, metabonomic, and cell ultrastructural analysis provided new insights into the biosynthesis of very long-chain monounsaturated fatty acids. In addition, three KCS genes were found that may contribute to regulating nervonic acid biosynthesis. The KCS paralogous gene family expanded to 28 members, with 10 genes clustered together and distributed in the 0.27-Mb region of pseudochromosome 4. Our chromosome-scale genomic characterization may facilitate the discovery of agronomically important genes and stimulate functional genetic research on A. truncatum. Furthermore, the data presented also offer important foundations from which to study the molecular mechanisms influencing the production of nervonic acids.
The proteins structure can be mainly classified into four classes: all-alpha, all-beta, alpha/beta, and alpha + beta protein according to their chain fold topologies. For the purpose of predicting the protein structural class, a new predicting algorithm, in which the increment of diversity combines with Quadratic Discriminant analysis, is presented to study and predict protein structural class. On the basis of the concept of the pseudo amino acid composition (Chou, Proteins: Struct Funct Genet 2001, 43, 246; Erratum: Proteins Struct Funct Genet 2001, 44, 60), 400 dipeptide components and 20 amino acid composition are, respectively, selected as parameters of diversity source. Total of 204 nonhomologous proteins constructed by Chou (Chou, Biochem Biophys Res Commun 1999, 264, 216) are used for training and testing the predictive model. The predicted results by using the pseudo amino acids approach as proposed in this paper can remarkably improve the success rates, and hence the current method may play a complementary role to other existing methods for predicting protein structural classification.
Promoters are modular DNA structures containing complex regulatory elements required for gene transcription initiation. Hence, the identification of promoters using machine learning approach is very important for improving genome annotation and understanding transcriptional regulation. In recent years, many methods have been proposed for the prediction of eukaryotic and prokaryotic promoters. However, the performances of these methods are still far from being satisfactory. In this article, we develop a hybrid approach (called IPMD) that combines position correlation score function and increment of diversity with modified Mahalanobis Discriminant to predict eukaryotic and prokaryotic promoters. By applying the proposed method to Drosophila melanogaster, Homo sapiens, Caenorhabditis elegans, Escherichia coli, and Bacillus subtilis promoter sequences, we achieve the sensitivities and specificities of 90.6 and 97.4% for D. melanogaster, 88.1 and 94.1% for H. sapiens, 83.3 and 95.2% for C. elegans, 84.9 and 91.4% for E. coli, as well as 80.4 and 91.3% for B. subtilis. The high accuracies indicate that the IPMD is an efficient method for the identification of eukaryotic and prokaryotic promoters. This approach can also be extended to predict other species promoters.
By using the composite vector with increment of diversity, position conservation scoring function, and predictive secondary structures to express the information of sequence, a support vector machine (SVM) algorithm for predicting beta- and gamma-turns in the proteins is proposed. The 426 and 320 nonhomologous protein chains described by Guruprasad and Rajkumar (Guruprasad and Rajkumar J. Biosci 2000, 25,143) are used for training and testing the predictive model of the beta- and gamma-turns, respectively. The overall prediction accuracy and the Matthews correlation coefficient in 7-fold cross-validation are 79.8% and 0.47, respectively, for the beta-turns. The overall prediction accuracy in 5-fold cross-validation is 61.0% for the gamma-turns. These results are significantly higher than the other algorithms in the prediction of beta- and gamma-turns using the same datasets. In addition, the 547 and 823 nonhomologous protein chains described by Fuchs and Alix (Fuchs and Alix Proteins: Struct Funct Bioinform 2005, 59, 828) are used for training and testing the predictive model of the beta- and gamma-turns, and better results are obtained. This algorithm may be helpful to improve the performance of protein turns' prediction. To ensure the ability of the SVM method to correctly classify beta-turn and non-beta-turn (gamma-turn and non-gamma-turn), the receiver operating characteristic threshold independent measure curves are provided.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.