PreDNA: accurate prediction of DNA-binding sites in proteins by integrating sequence and geometric structure information

Li, Tao; Li, Qian‐Zhong; Liu, Shuai; Fan, Guoliang; Zuo, Yongchun; Peng, Yong

doi:10.1093/bioinformatics/btt029

Cited by 43 publications

(48 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As there are a number of combination methods and concatenation methods, we only consider the state-of-the-art works for the respective groups. Consequently, Ma et al’s work using combination method [56] and Li et al’s work [32] using the concatenation methods are used for comparison. In Ma et al’s work, it used PSSM with four physicochemical properties including the lone electron pairs, hydrophobicity, side chain pKa value and molecular mass are combined to calculate the feature representation for residues.…”

Section: Resultsmentioning

confidence: 99%

“…The similarity between any two proteins in PDNA-62 is less than 25%. The second benchmarking dataset, PDNA-224, is a recently developed dataset for DNA-binding residue prediction [32], which contains 224 protein sequences. The 224 protein sequences are extracted from 224 protein-DNA complexes retrieved from PDB [31] by using the cut-off pair-wise sequence similarity of 25%.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation

Zhou

et al. 2017

BMC Bioinformatics

View full text Add to dashboard Cite

BackgroundPrediction of DNA-binding residue is important for understanding the protein-DNA recognition mechanism. Many computational methods have been proposed for the prediction, but most of them do not consider the relationships of evolutionary information between residues.ResultsIn this paper, we first propose a novel residue encoding method, referred to as the Position Specific Score Matrix (PSSM) Relation Transformation (PSSM-RT), to encode residues by utilizing the relationships of evolutionary information between residues. PDNA-62 and PDNA-224 are used to evaluate PSSM-RT and two existing PSSM encoding methods by five-fold cross-validation. Performance evaluations indicate that PSSM-RT is more effective than previous methods. This validates the point that the relationship of evolutionary information between residues is indeed useful in DNA-binding residue prediction. An ensemble learning classifier (EL_PSSM-RT) is also proposed by combining ensemble learning model and PSSM-RT to better handle the imbalance between binding and non-binding residues in datasets. EL_PSSM-RT is evaluated by five-fold cross-validation using PDNA-62 and PDNA-224 as well as two independent datasets TS-72 and TS-61. Performance comparisons with existing predictors on the four datasets demonstrate that EL_PSSM-RT is the best-performing method among all the predicting methods with improvement between 0.02–0.07 for MCC, 4.18–21.47% for ST and 0.013–0.131 for AUC. Furthermore, we analyze the importance of the pair-relationships extracted by PSSM-RT and the results validates the usefulness of PSSM-RT for encoding DNA-binding residues.ConclusionsWe propose a novel prediction method for the prediction of DNA-binding residue with the inclusion of relationship of evolutionary information and ensemble learning. Performance evaluation shows that the relationship of evolutionary information between residues is indeed useful in DNA-binding residue prediction and ensemble learning can be used to address the data imbalance issue between binding and non-binding residues. A web service of EL_PSSM-RT (http://hlt.hitsz.edu.cn:8080/PSSM-RT_SVM/) is provided for free access to the biological research community.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-017-1792-8) contains supplementary material, which is available to authorized users.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation

Zhou

et al. 2017

BMC Bioinformatics

View full text Add to dashboard Cite

show abstract

“…SVM is one of the most common machine learning algorithm used for development of several bioinformatics prediction methods [15], [26]–[33]. SVM takes a set of feature vector attributes along with their real output as input.…”

Section: Methodsmentioning

confidence: 99%

Protein Sub-Nuclear Localization Prediction Using SVM and Pfam Domain Information

et al. 2014

View full text Add to dashboard Cite

The nucleus is the largest and the highly organized organelle of eukaryotic cells. Within nucleus exist a number of pseudo-compartments, which are not separated by any membrane, yet each of them contains only a specific set of proteins. Understanding protein sub-nuclear localization can hence be an important step towards understanding biological functions of the nucleus. Here we have described a method, SubNucPred developed by us for predicting the sub-nuclear localization of proteins. This method predicts protein localization for 10 different sub-nuclear locations sequentially by combining presence or absence of unique Pfam domain and amino acid composition based SVM model. The prediction accuracy during leave-one-out cross-validation for centromeric proteins was 85.05%, for chromosomal proteins 76.85%, for nuclear speckle proteins 81.27%, for nucleolar proteins 81.79%, for nuclear envelope proteins 79.37%, for nuclear matrix proteins 77.78%, for nucleoplasm proteins 76.98%, for nuclear pore complex proteins 88.89%, for PML body proteins 75.40% and for telomeric proteins it was 83.33%. Comparison with other reported methods showed that SubNucPred performs better than existing methods. A web-server for predicting protein sub-nuclear localization named SubNucPred has been established at http://14.139.227.92/mkumar/subnucpred/. Standalone version of SubNucPred can also be downloaded from the web-server.

show abstract

“…Sensitivity Specificity MCC AUC-ROC AUC-PR To further investigate the performance of JSD-based features proposed in this study, we analyzed two additional datasets, namely RBscore [2] and PreDNA datasets [37]. Although the RBscore and PreDNA datasets initially contain 381 and 224 DNA-binding proteins, respectively, we have eliminated a few proteins since they are either included in our training dataset or ineligible due to their MSAs.…”

Section: Featurementioning

confidence: 99%

A Novel Sequence-Based Feature for the Identification of DNA-Binding Sites in Proteins Using Jensen–Shannon Divergence

Dang

Meckbach

Tacke

et al. 2016

Entropy

View full text Add to dashboard Cite

Abstract:The knowledge of protein-DNA interactions is essential to fully understand the molecular activities of life. Many research groups have developed various tools which are either structure-or sequence-based approaches to predict the DNA-binding residues in proteins. The structure-based methods usually achieve good results, but require the knowledge of the 3D structure of protein; while sequence-based methods can be applied to high-throughput of proteins, but require good features. In this study, we present a new information theoretic feature derived from Jensen-Shannon Divergence (JSD) between amino acid distribution of a site and the background distribution of non-binding sites. Our new feature indicates the difference of a certain site from a non-binding site, thus it is informative for detecting binding sites in proteins. We conduct the study with a five-fold cross validation of 263 proteins utilizing the Random Forest classifier. We evaluate the functionality of our new features by combining them with other popular existing features such as position-specific scoring matrix (PSSM), orthogonal binary vector (OBV), and secondary structure (SS). We notice that by adding our features, we can significantly boost the performance of Random Forest classifier, with a clear increment of sensitivity and Matthews correlation coefficient (MCC).

show abstract

PreDNA: accurate prediction of DNA-binding sites in proteins by integrating sequence and geometric structure information

Cited by 43 publications

References 37 publications

EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation

EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation

Protein Sub-Nuclear Localization Prediction Using SVM and Pfam Domain Information

A Novel Sequence-Based Feature for the Identification of DNA-Binding Sites in Proteins Using Jensen–Shannon Divergence

Contact Info

Product

Resources

About