This article describes RegioSelectivity-Predictor (RS-Predictor), a new in silico method for generating predictive models of P450-mediated metabolism for drug-like compounds. Within this method, potential sites of metabolism (SOMs) are represented as “metabolophores”: A concept that describes the hierarchical combination of topological and quantum chemical descriptors needed to represent the reactivity of potential metabolic reaction sites. RS-Predictor modeling involves the use of metabolophore descriptors together with multiple-instance ranking (MIRank) to generate an optimized descriptor weight vector that encodes regioselectivity trends across all cases in a training set. The resulting pathway-independent,i isozyme-specific regioselectivity model may be used to predict potential metabolic liabilities. In the present work, cross-validated RS-Predictor models were generated for a set of 394 substrates of CYP 3A4 as a proof-of-principle for the method. Rank aggregation was then employed to merge independently generated predictions for each substrate into a single consensus prediction. The resulting consensus RS-Predictor models were shown to reliably identify at least one observed site of metabolism in the top two rank-positions on 78% of the substrates. Comparisons between RS-Predictor and previously described regioselectivity prediction methods reveal new insights into how in silico metabolite prediction methods should be compared.
RS-Predictor is a tool for creating pathway-independent, isozyme-specific site of metabolism (SOM) prediction models using any set of known cytochrome P450 substrates and metabolites. Until now, the RS-Predictor method was only trained and validated on CYP 3A4 data, but in the present study we report on the versatility the RS-Predictor modeling paradigm by creating and testing regioselectivity models for substrates of the nine most important CYP isozymes. Through curation of source literature, we have assembled 680 substrates distributed among CYPs 1A2, 2A6, 2B6, 2C19, 2C8, 2C9, 2D6, 2E1 and 3A4, which we believe is the largest publicly accessible collection of P450 ligands and metabolites ever released. A comprehensive investigation into the importance of different descriptor classes for predicting the regioselectivity of each isozyme is made through the generation of multiple independent RS-Predictor models for each set of isozyme substrates. Two of these models include a DFT reactivity descriptor derived from SMARTCyp. Optimal combinations of RS-Predictor and SMARTCyp are shown to have stronger performance than either method alone, while also exceeding the accuracy of the commercial regioselectivity prediction methods distributed by StarDrop and Schrödinger, correctly identifying a large proportion of the metabolites in each substrate set within the top two rank-positions: 1A2(83.0%), 2A6(85.7%), 2B6(82.1%), 2C19(86.2%), 2C8(83.8%), 2C9(84.5%), 2D6(85.9%), 2E1(82.8%), 3A4(82.3%) and merged(86.0%). Comprehensive datamining of each substrate set and careful statistical analyses of the predictions made by the different models revealed new insights into molecular features that control metabolic regioselectivity and enable accurate prospective prediction of likely SOMs.
A single linear program is proposed for discriminating between the elements of k disjoint point sets in the n-dimensional real space Rn. When the conical hulls of the k sets are (k-1)-point disjoint in R"+', a k-piece piecewise-linear surface generated by the linear program completely sepwates the k sets. This improves on a previous linear programming approach which required that each set be linearly separable from the remaining k-1 sets. When the conical hulls of the k sets are not (k-1)-point d~~sjoint. the proposed linear program generates an error-minimizing piecewise-linear separator for the k sets. For this case it is shown that the null solution is never a unique solver of the linear program and occurs only under the rather rare condition when the mean of each point set equals the mean of the means of the other k-1 sets. This makes the proposed linear computational programming formulation useful for approximately discriminating between k sets that are not piecewise-linear separable. Computational results are reported for three previously available databases.
We develop metrics for measuring the quality of synthetic health data for both education and research. We use novel and existing metrics to capture a synthetic dataset's resemblance, privacy, utility and footprint. Using these metrics, we develop an end-to-end workflow based on our generative adversarial network (GAN) method, HealthGAN, that creates privacy preserving synthetic health data. Our workflow meets privacy specifications of our data partner: (1) the HealthGAN is trained inside a secure environment; (2) the HealthGAN model is used outside of the secure environment by external users to generate synthetic data. This second step facilitates data handling for external users by avoiding de-identification, which may require special user training, be costly, or cause loss of data fidelity. This workflow is compared against five other baseline methods. While maintaining resemblance and utility comparable to other methods, HealthGAN provides the best privacy and footprint. We present two case studies in which our methodology was put to work in the classroom and research settings. We evaluate utility in the classroom through a data analysis challenge given to students and in research by replicating three different medical papers with synthetic data. Data, code, and the challenge that we organized for educational purposes are available.
We present a novel approach for analysis of Mycobacterium tuberculosis complex (MTC) strain genotyping data. Our work presents a first step in an ongoing project dedicated to the development of decision support tools for tuberculosis (TB) epidemiologists exploiting both genotyping and epidemiological data. We focus on spacer oligonucleotide typing (spoligotyping), a genotyping method based on analysis of a direct repeat (DR) locus. We use mixture models to identify strain families of MTC based on their spoligotyping patterns. Our algorithm, SPOTCLUST, incorporates biological information on spoligotype evolution, without attempting to derive the full phylogeny of MTC. We applied our algorithm to 535 different spoligotype patterns identified among 7166 MTC strains isolated between 1996 and 2004 from New York State TB patients. Two models were employed and validated: a 36-component model based on global spoligotype database SpolDB3, and a randomly initialized model (RIM) containing 48 components. Our analysis both confirmed previously expert-defined families of MTC strains and suggested certain new families. SPOTCLUST, which is available online, can be further improved by incorporating data obtained using additional strain genetic markers and epidemiological information. We demonstrate on New York City (NYC) patient data how the resulting models can potentially form the basis of TB control tools using genotyping.3
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.