Computers in chemistryComputers in chemistry V 0380 Active Learning with Support Vector Machines in the Drug Discovery Process. -(WARMUTH*, M. K.; LIAO, J.; RAETSCH, G.; MATHIESON, M.; PUTTA, S.; LEMMEN, C.; J. Chem. Inf. Comput. Sci. 43 (2003) 2, 667-673; Comp. Sci. Dep., Univ. Calif., Santa Cruz, CA 95064, USA; Eng.) -Lindner 22-232
We investigate the following data mining problem from computer-aided drug design: From a large collection of compounds, find those that bind to a target molecule in as few iterations of biochemical testing as possible. In each iteration a comparatively small batch of compounds is screened for binding activity toward this target. We employed the so-called "active learning paradigm" from Machine Learning for selecting the successive batches. Our main selection strategy is based on the maximum margin hyperplane-generated by "Support Vector Machines". This hyperplane separates the current set of active from the inactive compounds and has the largest possible distance from any labeled compound. We perform a thorough comparative study of various other selection strategies on data sets provided by DuPont Pharmaceuticals and show that the strategies based on the maximum margin hyperplane clearly outperform the simpler ones.
Background: Altering a protein's function by changing its sequence allows natural proteins to be converted into useful molecular tools. Current protein engineering methods are limited by a lack of high throughput physical or computational tests that can accurately predict protein activity under conditions relevant to its final application. Here we describe a new synthetic biology approach to protein engineering that avoids these limitations by combining high throughput gene synthesis with machine learning-based design algorithms.
We consider boosting algorithms that maintain a distribution over a set of examples. At each iteration a weak hypothesis is received and the distribution is updated. We motivate these updates as minimizing the relative entropy subject to linear constraints. For example AdaBoost constrains the edge of the last hypothesis w.r.t. the updated distribution to be at most γ = 0. In some sense, AdaBoost is "corrective" w.r.t. the last hypothesis. A cleaner boosting method is to be "totally corrective": the edges of all past hypotheses are constrained to be at most γ, where γ is suitably adapted.Using new techniques, we prove the same iteration bounds for the totally corrective algorithms as for their corrective versions. Moreover with adaptive γ, the algorithms provably maximizes the margin. Experimentally, the totally corrective versions return smaller convex combinations of weak hypotheses than the corrective ones and are competitive with LPBoost, a totally corrective boosting algorithm with no regularization, for which there is no iteration bound known.
Summary
What is known and objective
Drug‐drug interactions (DDI) are frequent causes of adverse clinical drug reactions. Efforts have been directed at the early stage to achieve accurate identification of DDI for drug safety assessments, including the development of in silico predictive methods. In particular, similarity‐based in silico methods have been developed to assess DDI with good accuracies, and machine learning methods have been employed to further extend the predictive range of similarity‐based approaches. However, the performance of a developed machine learning method is lower than expectations partly because of the use of less diverse DDI training data sets and a less optimal set of similarity measures.
Method
In this work, we developed a machine learning model using support vector machines (SVMs) based on the literature‐reported established set of similarity measures and comprehensive training data sets. The established similarity measures include the 2D molecular structure similarity, 3D pharmacophoric similarity, interaction profile fingerprint (IPF) similarity, target similarity and adverse drug effect (ADE) similarity, which were extracted from well‐known databases, such as DrugBank and Side Effect Resource (SIDER). A pairwise kernel was constructed for the known and possible drug pairs based on the five established similarity measures and then used as the input vector of the SVM.
Result
The 10‐fold cross‐validation studies showed a predictive performance of AUROC >0.97, which is significantly improved compared with the AUROC of 0.67 of an analogously developed machine learning model. Our study suggested that a similarity‐based SVM prediction is highly useful for identifying DDI.
Conclusion
in silico methods based on multifarious drug similarities have been suggested to be feasible for DDI prediction in various studies. In this way, our pairwise kernel SVM model had better accuracies than some previous works, which can be used as a pharmacovigilance tool to detect potential DDI.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.