Protein sequence analysis by incorporating modified chaos game and physicochemical properties into Chou's general pseudo amino acid composition

Xu, Chunrui; Sun, Dandan; Liu, Shenghui; Zhang, Yusen

doi:10.1016/j.jtbi.2016.06.034

Cited by 25 publications

(11 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We observe that the error rate when it comes to false classification of the species is close to zero and the result of our method is comparably good as the one obtained in ref. 51.…”

Section: Resultsmentioning

confidence: 99%

Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix

Zhang

Gutman

et al. 2017

Sci Rep

Self Cite

View full text Add to dashboard Cite

We develop a novel position-feature-based model for protein sequences by employing physicochemical properties of 20 amino acids and the measure of graph energy. The method puts the emphasis on sequence order information and describes local dynamic distributions of sequences, from which one can get a characteristic B-vector. Afterwards, we apply the relative entropy to the sequences representing B-vectors to measure their similarity/dissimilarity. The numerical results obtained in this study show that the proposed methods leads to meaningful results compared with competitors such as Clustal W.

show abstract

“…We observe that the error rate when it comes to false classification of the species is close to zero and the result of our method is comparably good as the one obtained in ref. 51.…”

Section: Resultsmentioning

confidence: 99%

Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix

Zhang

Gutman

et al. 2017

Sci Rep

Self Cite

View full text Add to dashboard Cite

show abstract

“…They have been widely used and proven to be effective in protein sequence analyses [25], structural classification [26–28], pattern recognition receptor prediction [29], and fold recognition [30]. Thus, we proposed a novel representation for a protein sequence based on the two features, i.e.…”

Section: Methodsmentioning

confidence: 99%

An Alignment-Free Algorithm in Comparing the Similarity of Protein Sequences Based on Pseudo-Markov Transition Probabilities among Amino Acids

Song

Yang

et al. 2016

PLoS ONE

View full text Add to dashboard Cite

In this paper, we have proposed a novel alignment-free method for comparing the similarity of protein sequences. We first encode a protein sequence into a 440 dimensional feature vector consisting of a 400 dimensional Pseudo-Markov transition probability vector among the 20 amino acids, a 20 dimensional content ratio vector, and a 20 dimensional position ratio vector of the amino acids in the sequence. By evaluating the Euclidean distances among the representing vectors, we compare the similarity of protein sequences. We then apply this method into the ND5 dataset consisting of the ND5 protein sequences of 9 species, and the F10 and G11 datasets representing two of the xylanases containing glycoside hydrolase families, i.e., families 10 and 11. As a result, our method achieves a correlation coefficient of 0.962 with the canonical protein sequence aligner ClustalW in the ND5 dataset, much higher than those of other 5 popular alignment-free methods. In addition, we successfully separate the xylanases sequences in the F10 family and the G11 family and illustrate that the F10 family is more heat stable than the G11 family, consistent with a few previous studies. Moreover, we prove mathematically an identity equation involving the Pseudo-Markov transition probability vector and the amino acids content ratio vector.

show abstract

“…In the second example, we apply our method to analyze a data set consisting of 36 protein sequences of 5 different families: Globin (1eca, 5mbn, 1hlb, 1hlm, 1babA, 1babB, 1ithA, 1mba, 2hbg, 2lhb, 3sdhA, 1ash, 1flp, 1myt, 1lh2, 2vhbA, 2vhb), Alpha-Beta (1aa9, 1gnp, 6q21A, 1ct9A, 1qraA, 5p21), Tim-Barrel (6xia, 2mnr, 1chrA, 4enl), Beta (1 cd8, 1ci5, 1qa9, 1cdb, 1neu, 1qfoA, 1hnf ), and Alpha (1cnp, 1jhg) [20,[43][44][45][46][47][48]. After extracting features by the method DCGR and reducing the dimensionality using PCA, the Manhattan distance was used to calculate the distance matrix of the 36 protein sequences.…”

Section: Similarity Analysis Of 36 Protein Sequencesmentioning

confidence: 99%

“…In order to illustrate the superiority of DCGR, we compared its performance with six other methods including ClustalW in [20,[43][44][45][46][47], and the phylogenetic trees constructed by the six methods have been shown in Additional file 1: Figures S2-S8. After comparison, DCGR showed best performance since most of the six methods erroneously clustered at least three proteins, especially for ClustalW, which erroneously clustered 5 proteins as reported in [43].…”

Section: Similarity Analysis Of 36 Protein Sequencesmentioning

confidence: 99%

“…This data set contains 50 beta-globin protein sequences from 50 species studied in [46,[49][50][51][52][53], and the accession numbers have been shown in Additional file 1: Notes 1.2. After extracting features by the method DCGR and reducing the dimensionality using PCA, the Cosine distance was used to calculate the distance matrix of 50 beta-globin protein sequences, and the phylogenetic tree was also constructed in Fig.…”

Section: Similarity Analysis Of 50 Beta-globin Protein Sequencesmentioning

confidence: 99%

See 1 more Smart Citation

DCGR: feature extractions from protein sequences based on CGR via remodeling multiple information

Zhang

et al. 2019

BMC Bioinformatics

View full text Add to dashboard Cite

Background Protein feature extraction plays an important role in the areas of similarity analysis of protein sequences and prediction of protein structures, functions and interactions. The feature extraction based on graphical representation is one of the most effective and efficient ways. However, most existing methods suffer limitations from their method design. Results We introduce DCGR, a novel method for extracting features from protein sequences based on the chaos game representation, which is developed by constructing CGR curves of protein sequences according to physicochemical properties of amino acids, followed by converting the CGR curves into multi-dimensional feature vectors by using the distributions of points in CGR images. Tested on five data sets, DCGR was significantly superior to the state-of-the-art feature extraction methods. Conclusion The DCGR is practically powerful for extracting effective features from protein sequences, and therefore important in similarity analysis of protein sequences, study of protein-protein interactions and prediction of protein functions. It is freely available at https://sourceforge.net/projects/transcriptomeassembly/files/Feature%20Extraction . Electronic supplementary material The online version of this article (10.1186/s12859-019-2943-x) contains supplementary material, which is available to authorized users.

show abstract

Protein sequence analysis by incorporating modified chaos game and physicochemical properties into Chou's general pseudo amino acid composition

Cited by 25 publications

References 50 publications

Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix

Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix

An Alignment-Free Algorithm in Comparing the Similarity of Protein Sequences Based on Pseudo-Markov Transition Probabilities among Amino Acids

DCGR: feature extractions from protein sequences based on CGR via remodeling multiple information

Contact Info

Product

Resources

About