Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences

Wang, Wei; Sun, Lin; Zhang, Shiguang; Zhang, Hongjun; Shi, Jinling; Xu, Tianhe; Li, Keliang

doi:10.1186/s12859-017-1715-8

Cited by 13 publications

(22 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A most recent work predicted DNA-binding proteins interacting with ssDNA (single-stranded DNA) or dsDNA (double-stranded DNA) using OAAC (overall amino acid composition) features, dipeptide compositions, PSSM (position-specific scoring matrix profiles) and split amino acid composition (SAA) [ 33 ]. Testing by SVM (support vector machine) and RF (random forest) classification model, their method can achieve the accuracy of 88.7% and AUC of 0.919.…”

Section: Discussionmentioning

confidence: 99%

On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach

Gong

et al. 2017

PLoS ONE

View full text Add to dashboard Cite

DNA-binding proteins play pivotal roles in alternative splicing, RNA editing, methylating and many other biological functions for both eukaryotic and prokaryotic proteomes. Predicting the functions of these proteins from primary amino acids sequences is becoming one of the major challenges in functional annotations of genomes. Traditional prediction methods often devote themselves to extracting physiochemical features from sequences but ignoring motif information and location information between motifs. Meanwhile, the small scale of data volumes and large noises in training data result in lower accuracy and reliability of predictions. In this paper, we propose a deep learning based method to identify DNA-binding proteins from primary sequences alone. It utilizes two stages of convolutional neutral network to detect the function domains of protein sequences, and the long short-term memory neural network to identify their long term dependencies, an binary cross entropy to evaluate the quality of the neural networks. When the proposed method is tested with a realistic DNA binding protein dataset, it achieves a prediction accuracy of 94.2% at the Matthew’s correlation coefficient of 0.961. Compared with the LibSVM on the arabidopsis and yeast datasets via independent tests, the accuracy raises by 9% and 4% respectively. Comparative experiments using different feature extraction methods show that our model performs similar accuracy with the best of others, but its values of sensitivity, specificity and AUC increase by 27.83%, 1.31% and 16.21% respectively. Those results suggest that our method is a promising tool for identifying DNA-binding proteins.

show abstract

Section: Discussionmentioning

confidence: 99%

On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach

Gong

et al. 2017

PLoS ONE

View full text Add to dashboard Cite

show abstract

“…There are some results showing that sequence-based calculation methods are of great use to predict binding sites ( Wang et al, 2017 ; Wang et al, 2019c ). The evolutionary information of the protein sequence is encoded by the position-specific scoring matrix (PSSM).…”

Section: Methodsmentioning

confidence: 99%

Prediction of DNA-Binding Protein–Drug-Binding Sites Using Residue Interaction Networks and Sequence Feature

Wang

Liu

Zhang

et al. 2022

Front. Bioeng. Biotechnol.

Self Cite

View full text Add to dashboard Cite

Identification of protein–ligand binding sites plays a critical role in drug discovery. However, there is still a lack of targeted drug prediction for DNA-binding proteins. This study aims at the binding sites of DNA-binding proteins and drugs, by mining the residue interaction network features, which can describe the local and global structure of amino acids, combined with sequence feature. The predictor of DNA-binding protein–drug-binding sites is built by employing the Extreme Gradient Boosting (XGBoost) model with random under-sampling. We found that the residue interaction network features can better characterize DNA-binding proteins, and the binding sites with high betweenness value and high closeness value are more likely to interact with drugs. The model shows that the residue interaction network features can be used as an important quantitative indicator of drug-binding sites, and this method achieves high predictive performance for the binding sites of DNA-binding protein–drug. This study will help in drug discovery research for DNA-binding proteins.

show abstract

“…For sequence-based feature calculation, we extracted 8833 DNA-binding proteins. Which contains 2136 DSBs and 339 SSBs obtained from the literature of Wang et al [37] And the other part is collected from UniProtKB/Swiss-Prot (www.uniprot.org). To eliminate redundancy, CD-HIT was used to remove proteins with a sequence similarity > 70% [40].…”

Section: Datasetsmentioning

confidence: 99%

“…Because the gap between available sequences and structures of DNA binding proteins in UniProtKB/Swiss-Prot (www.uniprot.org) and the PDB (www.rcsb.org/pdb/) has been growing exponentially, structure-based methods can no longer meet the needs of high-throughput research [35,36]. Subsequently, Wei Wang et al [37] developed a machine learning method (Wang, 2017) with only single sequence information such as overall amino acid composition (OAAC) features, dipeptide compositions, and position-specific scoring matrix profiles (PSSMs). The results showed an accuracy of 88.7% and an AUC (area under the curve) of 0.919 on the benchmark datasets.…”

Section: Introductionmentioning

confidence: 99%

PredPSD: A Gradient Tree Boosting Approach for Single-Stranded and Double-Stranded DNA Binding Protein Prediction

et al. 2019

View full text Add to dashboard Cite

Interactions between proteins and DNAs play essential roles in many biological processes. DNA binding proteins can be classified into two categories. Double-stranded DNA-binding proteins (DSBs) bind to double-stranded DNA and are involved in a series of cell functions such as gene expression and regulation. Single-stranded DNA-binding proteins (SSBs) are necessary for DNA replication, recombination, and repair and are responsible for binding to the single-stranded DNA. Therefore, the effective classification of DNA-binding proteins is helpful for functional annotations of proteins. In this work, we propose PredPSD, a computational method based on sequence information that accurately predicts SSBs and DSBs. It introduces three novel feature extraction algorithms. In particular, we use the autocross-covariance (ACC) transformation to transform feature matrices into fixed-length vectors. Then, we put the optimal feature subset obtained by the minimal-redundancy-maximal-relevance criterion (mRMR) feature selection algorithm into the gradient tree boosting (GTB). In 10-fold cross-validation based on a benchmark dataset, PredPSD achieves promising performances with an AUC score of 0.956 and an accuracy of 0.912, which are better than those of existing methods. Moreover, our method has significantly improved the prediction accuracy in independent testing. The experimental results show that PredPSD can significantly recognize the binding specificity and differentiate DSBs and SSBs.

show abstract

Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences

Cited by 13 publications

References 56 publications

On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach

On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach

Prediction of DNA-Binding Protein–Drug-Binding Sites Using Residue Interaction Networks and Sequence Feature

PredPSD: A Gradient Tree Boosting Approach for Single-Stranded and Double-Stranded DNA Binding Protein Prediction

Contact Info

Product

Resources

About