Anita Rácz scite author profile

BackgroundCheminformaticians are equipped with a very rich toolbox when carrying out molecular similarity calculations. A large number of molecular representations exist, and there are several methods (similarity and distance metrics) to quantify the similarity of molecular representations. In this work, eight well-known similarity/distance metrics are compared on a large dataset of molecular fingerprints with sum of ranking differences (SRD) and ANOVA analysis. The effects of molecular size, selection methods and data pretreatment methods on the outcome of the comparison are also assessed.ResultsA supplier database (https://mcule.com/) was used as the source of compounds for the similarity calculations in this study. A large number of datasets, each consisting of one hundred compounds, were compiled, molecular fingerprints were generated and similarity values between a randomly chosen reference compound and the rest were calculated for each dataset. Similarity metrics were compared based on their ranking of the compounds within one experiment (one dataset) using sum of ranking differences (SRD), while the results of the entire set of experiments were summarized on box and whisker plots. Finally, the effects of various factors (data pretreatment, molecule size, selection method) were evaluated with analysis of variance (ANOVA).ConclusionsThis study complements previous efforts to examine and rank various metrics for molecular similarity calculations. Here, however, an entirely general approach was taken to neglect any a priori knowledge on the compounds involved, as well as any bias introduced by examining only one or a few specific scenarios. The Tanimoto index, Dice index, Cosine coefficient and Soergel distance were identified to be the best (and in some sense equivalent) metrics for similarity calculations, i.e. these metrics could produce the rankings closest to the composite (average) ranking of the eight metrics. The similarity metrics derived from Euclidean and Manhattan distances are not recommended on their own, although their variability and diversity from other similarity metrics might be advantageous in certain cases (e.g. for data fusion). Conclusions are also drawn regarding the effects of molecule size, selection method and data pretreatment on the ranking behavior of the studied metrics.Graphical AbstractA visual summary of the comparison of similarity metrics with sum of ranking differences (SRD).Electronic supplementary materialThe online version of this article (doi:10.1186/s13321-015-0069-3) contains supplementary material, which is available to authorized users.

show abstract

Life beyond the Tanimoto coefficient: similarity measures for interaction fingerprints

Rácz

Bajusz

Héberger

2018

J Cheminform

View full text Add to dashboard Cite

BackgroundInteraction fingerprints (IFP) have been repeatedly shown to be valuable tools in virtual screening to identify novel hit compounds that can subsequently be optimized to drug candidates. As a complementary method to ligand docking, IFPs can be applied to quantify the similarity of predicted binding poses to a reference binding pose. For this purpose, a large number of similarity metrics can be applied, and various parameters of the IFPs themselves can be customized. In a large-scale comparison, we have assessed the effect of similarity metrics and IFP configurations to a number of virtual screening scenarios with ten different protein targets and thousands of molecules. Particularly, the effect of considering general interaction definitions (such as Any Contact, Backbone Interaction and Sidechain Interaction), the effect of filtering methods and the different groups of similarity metrics were studied.ResultsThe performances were primarily compared based on AUC values, but we have also used the original similarity data for the comparison of similarity metrics with several statistical tests and the novel, robust sum of ranking differences (SRD) algorithm. With SRD, we can evaluate the consistency (or concordance) of the various similarity metrics to an ideal reference metric, which is provided by data fusion from the existing metrics. Different aspects of IFP configurations and similarity metrics were examined based on SRD values with analysis of variance (ANOVA) tests.ConclusionA general approach is provided that can be applied for the reliable interpretation and usage of similarity measures with interaction fingerprints. Metrics that are viable alternatives to the commonly used Tanimoto coefficient were identified based on a comparison with an ideal reference metric (consensus). A careful selection of the applied bits (interaction definitions) and IFP filtering rules can improve the results of virtual screening (in terms of their agreement with the consensus metric). The open-source Python package FPKit was introduced for the similarity calculations and IFP filtering; it is available at: https://github.com/davidbajusz/fpkit.Electronic supplementary materialThe online version of this article (10.1186/s13321-018-0302-y) contains supplementary material, which is available to authorized users.

show abstract

Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification

2021

View full text Add to dashboard Cite

Applied datasets can vary from a few hundred to thousands of samples in typical quantitative structure-activity/property (QSAR/QSPR) relationships and classification. However, the size of the datasets and the train/test split ratios can greatly affect the outcome of the models, and thus the classification performance itself. We compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the best parameter settings in nonbinary (multiclass) classification. It is also known that the models are ranked differently according to the performance merit(s) used. Here, 25 performance parameters were calculated for each model, then factorial ANOVA was applied to compare the results. The results clearly show the differences not just between the applied machine learning algorithms but also between the dataset sizes and to a lesser extent the train/test split ratios. The XGBoost algorithm could outperform the others, even in multiclass modeling. The performance parameters reacted differently to the change of the sample set size; some of them were much more sensitive to this factor than the others. Moreover, significant differences could be detected between train/test split ratios as well, exerting a great effect on the test validation of our models.

show abstract

Binding mode analysis and enrichment studies on homology models of the human histamine H4 receptor

Kingsford-Adaboh

Noszál

Rácz

et al. 2008

European Journal of Medicinal Chemistry

View full text Add to dashboard Cite

The small molecule AUTEN-99 (autophagy enhancer-99) prevents the progression of neurodegenerative symptoms

Kovács

Billes²,

Komlós³

et al. 2017

Sci Rep

View full text Add to dashboard Cite

Autophagy functions as a main route for the degradation of superfluous and damaged constituents of the cytoplasm. Defects in autophagy are implicated in the development of various age-dependent degenerative disorders such as cancer, neurodegeneration and tissue atrophy, and in accelerated aging. To promote basal levels of the process in pathological settings, we previously screened a small molecule library for novel autophagy-enhancing factors that inhibit the myotubularin-related phosphatase MTMR14/Jumpy, a negative regulator of autophagic membrane formation. Here we identify AUTEN-99 (autophagy enhancer-99), which activates autophagy in cell cultures and animal models. AUTEN-99 appears to effectively penetrate through the blood-brain barrier, and impedes the progression of neurodegenerative symptoms in Drosophila models of Parkinson’s and Huntington’s diseases. Furthermore, the molecule increases the survival of isolated neurons under normal and oxidative stress-induced conditions. Thus, AUTEN-99 serves as a potent neuroprotective drug candidate for preventing and treating diverse neurodegenerative pathologies, and may promote healthy aging.

show abstract

ADMA Impairs Nitric Oxide–Mediated Arteriolar Function Due to Increased Superoxide Production by Angiotensin II–NAD(P)H Oxidase Pathway

et al. 2008

View full text Add to dashboard Cite

Abstract-Asymmetrical dimethylarginine (ADMA) is thought to be an endogenous regulator of arteriolar tone by inhibiting NO synthase. However, our previous studies showed that, in isolated arterioles, ADMA induced superoxide production as well. Thus, the mechanisms by which ADMA affects arteriolar tone remain obscure. We hypothesized that ADMA, by activating NAD(P)H oxidase, increases superoxide production, interfering with NO mediation of flow-induced dilation. In the presence of indomethacin, isolated arterioles from rat gracilis muscle (Ϸ160 m at 80 mm Hg) were incubated with ADMA (10 Ϫ4 mol/L), which elicited significant constriction (from 162Ϯ4 to 143Ϯ4 m) and eliminated the dilations to increases in intraluminal flow (from a maximum 31Ϯ2% to 3Ϯ1%; PϽ0.05). In the presence of ADMA, superoxide dismutase plus catalase restored dilations to flow (from a maximum 3Ϯ1% to 28Ϯ2%). Endothelial denudation or incubation of arterioles with the NAD(P)H oxidase inhibitor apocynin or the angiotensin-converting enzyme inhibitor quinapril inhibited ADMA-induced constriction. In addition, apocynin, quinapril, or the angiotensin type 1 receptor blocker losartan restored flow-induced dilations reduced by ADMA. Furthermore, inhibition of NO synthase abolished the "superoxide dismutase/catalase-restored" flow-induced dilation in the presence of ADMA. ADMA-induced increased production of superoxide, assessed by dihydroethidium fluorescence, was inhibited by apocynin, quinapril, or losartan. We suggest that ADMA activates the local renin-angiotensin system, and the angiotensin II released activates NAD(P)H oxidase; superoxide produced interferes with the bioavailability of NO, resulting in diminished flow-induced dilation, a mechanism that may contribute to the development of arteriolar dysfunction and increased tone associated with elevated ADMA levels. Key Words: ADMA Ⅲ regional blood flow Ⅲ flow-dependent dilation Ⅲ NO Ⅲ oxidative stress Ⅲ ACE A symmetrical dimethylarginine (ADMA) is a naturally occurring L-arginine analogue derived from the proteolysis of proteins containing methylated arginine residues. 1-3 By now, numerous studies suggest that an elevated plasma level of ADMA is associated with endothelial dysfunction and is a risk factor for several human diseases, 4 such as hyperhomocysteinanemia, 5 hypertension, 6 coronary artery disease, 7 peripheral arterial occlusive disease, 8 pulmonary hypertension, 9 and preeclampsia. 10 In our previous studies in isolated arterioles we have found that elevated levels of exogenous ADMA impair the regulation of arteriolar resistance by interfering with the NO mediation of flow/shear stress-induced dilation. 11 Previous studies have found that ADMA inhibits purified NO synthase (NOS) catalytic activity, thus, release of NO and NOmediated vascular responses. 12,13 In addition, however, we have also found that ADMA elicits the release of reactive oxygen species, primarily superoxide, because superoxide dismutase reversed the ADMA-elicited reduction in basal diameter and ethidium bromide...

show abstract

Consistency of QSAR models: Correct split of training and test sets, ranking of models and performance parameters

Rácz

Bajusz

Héberger

2015

SAR and QSAR in Environmental Research

View full text Add to dashboard Cite

Recent implementations of QSAR modeling software provide the user with numerous models and a wealth of information. In this work, we provide some guidance on how one should interpret the results of QSAR modeling, compare and assess the resulting models and select the best and most consistent ones. Two QSAR datasets are applied as case studies for the comparison of model performance parameters and model selection methods. We demonstrate the capabilities of sum of ranking differences (SRD) in model selection and ranking and identify the best performance indicators and models. While the exchange of the original training and (external) test sets does not affect the ranking of performance parameters, it provides improved models in certain cases (despite the lower number of molecules in the training set). Performance parameters for external validation are substantially separated from the other merits in SRD analyses, highlighting their value in data fusion. IntroductionModel comparison and selection of the best one is an evergreen among scientific investigations. The process is contradictory: bias-variance trade-off, local minima, searching for robust models, the principle of parsimony, etc.; all ideas consider various models inherently. One model is better from one point of view, the other should be better from another point of view. Even if one fixes the aim (and algorithm) according to various criteria: R 2 , Q 2 , Mallows Cp, Akaike Information criterion, Bayesian information criterion, etc., their application on the training, validation and test sets will necessarily provide different models for description of existing data and for prediction of future samples. The case is even more complicated with the fact that we deal with random effects: i.e. it is relatively easy to find conditions where one of the models is clearly superior compared to other models. Many authors select instinctively or deliberately such datasets, splits, etc. for which their own descriptor selection or model building algorithm performs better than the rival approaches.Kalivas et al. suggested selecting harmonious models taking into account the bias-variance trade-off: it is difficult and not unambiguous to find the 'best' model. A biased model provides less variance and vice versa. However, harmonious models are not necessarily parsimonious [1]. The scope of the methodology has recently been extended with the idea of sum of ranking differences (SRD) for partial least squares and ridge regression models [2].Principal-component analysis (PCA) has been applied by Geladi [3,4] and Todeschini et al. [5] to find the best and worst regression and classification models, respectively. PCAs were completed on a matrix of regression vectors and dominant patterns (grouping, outliers) could be detected among the models. The interpretation of PCA results is easy: principal component 1 marks the direction of the best and worst regression models. Principal component 2 reflects various behaviors of the regression models on various datasets. The models lyin...

show abstract

Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: Theory and characteristics†

et al. 2021

View full text Add to dashboard Cite

Quantification of the similarity of objects is a key concept in many areas of computational science. This includes cheminformatics, where molecular similarity is usually quantified based on binary fingerprints. While there is a wide selection of available molecular representations and similarity metrics, there were no previous efforts to extend the computational framework of similarity calculations to the simultaneous comparison of more than two objects (molecules) at the same time. The present study bridges this gap, by introducing a straightforward computational framework for comparing multiple objects at the same time and providing extended formulas for as many similarity metrics as possible. In the binary case (i.e. when comparing two molecules pairwise) these are naturally reduced to their well-known formulas. We provide a detailed analysis on the effects of various parameters on the similarity values calculated by the extended formulas. The extended similarity indices are entirely general and do not depend on the fingerprints used. Two types of variance analysis (ANOVA) help to understand the main features of the indices: (i) ANOVA of mean similarity indices; (ii) ANOVA of sum of ranking differences (SRD). Practical aspects and applications of the extended similarity indices are detailed in the accompanying paper: Miranda-Quintana et al. J Cheminform. 2021. 10.1186/s13321-021-00504-4. Python code for calculating the extended similarity metrics is freely available at: https://github.com/ramirandaq/MultipleComparisons.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Anita Rácz

Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?

Life beyond the Tanimoto coefficient: similarity measures for interaction fingerprints

Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification

Binding mode analysis and enrichment studies on homology models of the human histamine H4 receptor

The small molecule AUTEN-99 (autophagy enhancer-99) prevents the progression of neurodegenerative symptoms

ADMA Impairs Nitric Oxide–Mediated Arteriolar Function Due to Increased Superoxide Production by Angiotensin II–NAD(P)H Oxidase Pathway

Consistency of QSAR models: Correct split of training and test sets, ranking of models and performance parameters

Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: Theory and characteristics†

Contact Info

Product

Resources

About