Richard L. Marchese Robinson scite author profile

Model interpretation is one of the key aspects of the model evaluation process. The explanation of the relationship between model variables and outputs is relatively easy for statistical models, such as linear regressions, thanks to the availability of model parameters and their statistical significance. For "black box" models, such as random forest, this information is hidden inside the model structure. This work presents an approach for computing feature contributions for random forest classification models. It allows for the determination of the influence of each variable on the model prediction for an individual instance. By analysing feature contributions for a training dataset, the most significant variables can be determined and their typical contribution towards predictions made for individual classes, i.e., class-specific feature contribution "patterns", are discovered. These patterns represent a standard behaviour of the model and allow for an additional assessment of the model reliability for a new data. Interpretation of feature contributions for two UCI benchmark datasets shows the potential of the proposed methodology. The robustness of results is demonstrated through an extensive analysis of feature contributions calculated for a large number of generated random forest models. * a.m.wojak@bradford.ac.uk

show abstract

How should the completeness and quality of curated nanomaterial data be evaluated?

Robinson

Lynch

Peijnenburg

et al. 2016

Nanoscale

View full text Add to dashboard Cite

Nanotechnology is of increasing significance. Curation of nanomaterial data into electronic databases offers opportunities to better understand and predict nanomaterials’ behaviour. This supports innovation in, and regulation of, nanotechnology. It is commonly understood that curated data need to be sufficiently complete and of sufficient quality to serve their intended purpose. However, assessing data completeness and quality is non-trivial in general and is arguably especially difficult in the nanoscience area, given its highly multidisciplinary nature. The current article, part of the Nanomaterial Data Curation Initiative series, addresses how to assess the completeness and quality of (curated) nanomaterial data. In order to address this key challenge, a variety of related issues are discussed: the meaning and importance of data completeness and quality, existing approaches to their assessment and the key challenges associated with evaluating the completeness and quality of curated nanomaterial data. Considerations which are specific to the nanoscience area and lessons which can be learned from other relevant scientific disciplines are considered. Hence, the scope of this discussion ranges from physicochemical characterisation requirements for nanomaterials and interference of nanomaterials with nanotoxicology assays to broader issues such as minimum information checklists, toxicology data quality schemes and computational approaches that facilitate evaluation of the completeness and quality of (curated) data. This discussion is informed by a literature review and a survey of key nanomaterial data curation stakeholders. Finally, drawing upon this discussion, recommendations are presented concerning the central question: how should the completeness and quality of curated nanomaterial data be evaluated?

show abstract

Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets

Robinson

Palczewska

Palczewski

et al. 2017

J. Chem. Inf. Model.

View full text Add to dashboard Cite

The ability to interpret the predictions made by quantitative structure activity relationships (QSARs) offers a number of advantages. Whilst QSARs built using non6linear modelling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modelling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting non6linear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to two widely used linear modelling approaches: linear Support Vector Machines (SVM), or Support Vector Regression (SVR), and Partial Least Squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions, using novel scoring schemes for assessing Heat Map images of substructural contributions. We critically assess different approaches to interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed, public domain benchmark datasets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modelling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpreting non6linear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random

show abstract

Hierarchical virtual screening for the discovery of new molecular scaffolds in antibacterial hit identification

Ballester

Mangold

Howard

et al. 2012

J. R. Soc. Interface.

View full text Add to dashboard Cite

One of the initial steps of modern drug discovery is the identification of small organic molecules able to inhibit a target macromolecule of therapeutic interest. A small proportion of these hits are further developed into lead compounds, which in turn may ultimately lead to a marketed drug. A commonly used screening protocol used for this task is high-throughput screening (HTS). However, the performance of HTS against antibacterial targets has generally been unsatisfactory, with high costs and low rates of hit identification. Here, we present a novel computational methodology that is able to identify a high proportion of structurally diverse inhibitors by searching unusually large molecular databases in a time-, cost- and resource-efficient manner. This virtual screening methodology was tested prospectively on two versions of an antibacterial target (type II dehydroquinase from Mycobacterium tuberculosis and Streptomyces coelicolor), for which HTS has not provided satisfactory results and consequently practically all known inhibitors are derivatives of the same core scaffold. Overall, our protocols identified 100 new inhibitors, with calculated Ki ranging from 4 to 250 μM (confirmed hit rates are 60% and 62% against each version of the target). Most importantly, over 50 new active molecular scaffolds were discovered that underscore the benefits that a wide application of prospectively validated in silico screening tools is likely to bring to antibacterial hit identification.

show abstract

Interpreting random forest models using a feature contribution method

Palczewska

Palczewski

Robinson

et al. 2013

View full text Add to dashboard Cite

Model interpretation is one of the key aspects of the model evaluation process. The explanation of the relationship between model variables and outputs is easy for statistical models, such as linear regressions, thanks to the availability of model parameters and their statistical significance. For "black box" models, such as random forest, this information is hidden inside the model structure. This work presents an approach for computing feature contributions for random forest classification models. It allows for the determination of the influence of each variable on the model prediction for an individual instance. Interpretation of feature contributions for two UCI benchmark datasets shows the potential of the proposed methodology. The robustness of results is demonstrated through an extensive analysis of feature contributions calculated for a large number of generated random forest models.

show abstract

Molecular fingerprint-derived similarity measures for toxicological read-across: Recommendations for optimal use

Mellor

Robinson

Benigni³

et al. 2019

Regulatory Toxicology and Pharmacology

View full text Add to dashboard Cite

show abstract

Development and Comparison of hERG Blocker Classifiers: Assessment on Different Datasets Yields Markedly Different Results

Robinson

Glen

Mitchell

2011

Molecular Informatics

View full text Add to dashboard Cite

In recent years, considerable effort has been invested in the development of classification models for prospective hERG inhibitors, due to the implications of hERG blockade for cardiotoxicity and the low throughput of functional hERG assays. We present novel approaches for binary classification which seek to separate strong inhibitors (IC50 <1 µM) from 'non-blockers' exhibiting moderate (1-10 µM) or weak (IC50 ≥10 µM) inhibition, as required by the pharmaceutical industry. Our approaches are based on (discretized) 2D descriptors, selected using Winnow, with additional models generated using Random Forest (RF) and Support Vector Machines (SVMs). We compare our models to those previously developed by Thai and Ecker and by Dubus et al. The purpose of this paper is twofold: 1. To propose that our approaches (with Matthews Correlation Coefficients from 0.40 to 0.87 on truly external test sets, when extrapolation beyond the applicability domain was not evident and sufficient quantities of data were available for training) are competitive with those currently proposed in the literature. 2. To highlight key issues associated with building and assessing truly predictive models, in particular the considerable variation in model performance when training and testing on different datasets.

show abstract

12 3 4 5

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.