CLIPS-1D: analysis of multiple sequence alignments to deduce for residue-positions a role in catalysis, ligand-binding, or protein structure

Janda, Jan-Oliver; Busch, Markus; Kück, Fabian; Porfenenko, Mikhail; Merkl, Rainer

doi:10.1186/1471-2105-13-55

Cited by 15 publications

(17 citation statements)

References 57 publications

(74 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We assume a log-normal distribution because log transformed metabolomics data are usually approximately normally distributed (Karpievitch et al 2012). Further, using a log-normal distribution is consistent with typical analytical approaches to ‘omics data that log transform intensity values and then use a t -test, ANOVA, or linear regression which assume normally distributed data.…”

Section: Model Formulation and Methodologymentioning

confidence: 99%

“…In mass spectrometry ‘omics studies, however, missing values can originate from detection limit censoring and hence are MNAR. Because most imputation techniques produce unbiased results only if the missing data are MCAR or missing at random, but not MNAR (Karpievitch et al 2012, Lee 2004), using the imputation methods developed for microarrary studies in mass spectrometry ‘omics studies could lead to biased results. Further, the choice of imputation method can substantially affect the results and interpretation of analyses of metabolomics data (Hrydziuszko and Viant 2012).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Accounting for undetected compounds in statistical analyses of mass spectrometry ‘omic studies

Taylor¹,

Leiserowitz

Kim³

2013

Statistical Applications in Genetics and Molecular Biology

View full text Add to dashboard Cite

Mass spectrometry is an important high-throughput technique for profiling small molecular compounds in biological samples and is widely used to identify potential diagnostic and prognostic compounds associated with disease. Commonly, this data generated by mass spectrometry has many missing values resulting when a compound is absent from a sample or is present but at a concentration below the detection limit. Several strategies are available for statistically analyzing data with missing values. The accelerated failure time (AFT) model assumes all missing values result from censoring below a detection limit. Under a mixture model, missing values can result from a combination of censoring and the absence of a compound. We compare power and estimation of a mixture model to an AFT model. Based on simulated data, we found the AFT model to have greater power to detect differences in means and point mass proportions between groups. However, the AFT model yielded biased estimates with the bias increasing as the proportion of observations in the point mass increased while estimates were unbiased with the mixture model except if all missing observations came from censoring. These findings suggest using the AFT model for hypothesis testing and mixture model for estimation. We demonstrated this approach through application to glycomics data of serum samples from women with ovarian cancer and matched controls.

show abstract

Section: Model Formulation and Methodologymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Accounting for undetected compounds in statistical analyses of mass spectrometry ‘omic studies

Taylor¹,

Leiserowitz

Kim³

2013

Statistical Applications in Genetics and Molecular Biology

View full text Add to dashboard Cite

show abstract

“…Given these results, we investigated whether functional residue prediction programs specifically designed to identify catalytic, ligand-binding and subtype-specific residues yield similar results. S2.1A–S2.1D Fig compares our analysis of the Gna1 subgroup to that of two such programs: FRpred [43] and CLIPS-1D [45]. This reveals that the structural bipartitioning of residues is unique to our hiMSA analysis, which therefore, at least in this case, is finding protein structural features that these other methods fail to identify.…”

Section: Applicationmentioning

confidence: 85%

Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations

Neuwald

Altschul

2016

PLoS Comput Biol

View full text Add to dashboard Cite

Over evolutionary time, members of a superfamily of homologous proteins sharing a common structural core diverge into subgroups filling various functional niches. At the sequence level, such divergence appears as correlations that arise from residue patterns distinct to each subgroup. Such a superfamily may be viewed as a population of sequences corresponding to a complex, high-dimensional probability distribution. Here we model this distribution as hierarchical interrelated hidden Markov models (hiHMMs), which describe these sequence correlations implicitly. By characterizing such correlations one may hope to obtain information regarding functionally-relevant properties that have thus far evaded detection. To do so, we infer a hiHMM distribution from sequence data using Bayes’ theorem and Markov chain Monte Carlo (MCMC) sampling, which is widely recognized as the most effective approach for characterizing a complex, high dimensional distribution. Other routines then map correlated residue patterns to available structures with a view to hypothesis generation. When applied to N-acetyltransferases, this reveals sequence and structural features indicative of functionally important, yet generally unknown biochemical properties. Even for sets of proteins for which nothing is known beyond unannotated sequences and structures, this can lead to helpful insights. We describe, for example, a putative coenzyme-A-induced-fit substrate binding mechanism mediated by arginine residue switching between salt bridge and π-π stacking interactions. A suite of programs implementing this approach is available (psed.igs.umaryland.edu).

show abstract

“…Since the 3D structures are often unavailable, Capra and Singh [34] developed a window score for such predictions. The concrete shape of our scores takes pattern form Janda et al [45], who in turn refer to Fischer et al [33]. Our scores are convex combinations of the Jensen-Shannon terms associated with the residues belonging to the surrounding window w(k).…”

Section: Methodsmentioning

confidence: 99%

A Novel Sequence-Based Feature for the Identification of DNA-Binding Sites in Proteins Using Jensen–Shannon Divergence

Dang

Meckbach

Tacke

et al. 2016

Entropy

View full text Add to dashboard Cite

Abstract:The knowledge of protein-DNA interactions is essential to fully understand the molecular activities of life. Many research groups have developed various tools which are either structure-or sequence-based approaches to predict the DNA-binding residues in proteins. The structure-based methods usually achieve good results, but require the knowledge of the 3D structure of protein; while sequence-based methods can be applied to high-throughput of proteins, but require good features. In this study, we present a new information theoretic feature derived from Jensen-Shannon Divergence (JSD) between amino acid distribution of a site and the background distribution of non-binding sites. Our new feature indicates the difference of a certain site from a non-binding site, thus it is informative for detecting binding sites in proteins. We conduct the study with a five-fold cross validation of 263 proteins utilizing the Random Forest classifier. We evaluate the functionality of our new features by combining them with other popular existing features such as position-specific scoring matrix (PSSM), orthogonal binary vector (OBV), and secondary structure (SS). We notice that by adding our features, we can significantly boost the performance of Random Forest classifier, with a clear increment of sensitivity and Matthews correlation coefficient (MCC).

show abstract

CLIPS-1D: analysis of multiple sequence alignments to deduce for residue-positions a role in catalysis, ligand-binding, or protein structure

Cited by 15 publications

References 57 publications

Accounting for undetected compounds in statistical analyses of mass spectrometry ‘omic studies

Accounting for undetected compounds in statistical analyses of mass spectrometry ‘omic studies

Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations

A Novel Sequence-Based Feature for the Identification of DNA-Binding Sites in Proteins Using Jensen–Shannon Divergence

Contact Info

Product

Resources

About