We present a general approach to forming structure-activity relationships (SARs). This approach is based on representing chemical structure by atoms and their bond connectivities in combination with the inductive logic programming (ILP) algorithm PROGOL. Existing SAR methods describe chemical structure by using attributes which are general properties of an object. It is not possible to map chemical structure directly to attribute-based descriptions, as such descriptions have no internal organization. A more natural and general way to describe chemical structure is to use a relational description, where the internal construction of the description maps that of the object described. Our atom and bond connectivities representation is a relational description. ILP algorithms can form SARs with relational descriptions. We have tested the relational approach by investigating the SARs of 230 aromatic and heteroaromatic nitro compounds. These compounds had been split previously into two subsets, 188 compounds that were amenable to regression and 42 that were not. For the 188 compounds, a SAR was found that was as accurate as the best statistical or neural networkgenerated SARs. The PROGOL SAR has the advantages that it did not need the use of any indicator variables handcrafted by an expert, and the generated rules were easily comprehensible. For the 42 compounds, PROGOL formed a SAR that was significantly (P < 0.025) more accurate than linear regression, quadratic regression, and back-propagation. This SAR is based on an automatically generated structural alert for mutagenicity.A structure-activity relationship (SAR) models the relationship between activities and physicochemical properties of a set of compounds and is fundamental to many aspects of chemistry. SAR modeling has been applied to a multitude of biological systems and has aided the development of many new drugs (see refs. 1 and 2). To guide rational drug design a SAR should be both reliable and comprehensible. This paper presents an approach to forming SARs based on the machine learning program PROGOL (3). This approach allows the use of a rich representation of chemical structure and leads to SARs that are both accurate and simple to understand.There are two components to deriving a SAR: the choice of representation to describe the chemical structure of the compounds and the learning algorithm employed. The form of learning algorithm restricts the representation that can be employed. Widely used learning algorithms include linear regression (4), partial least-squares regression (PLS) (5), neural networks (6, 7), and decision trees (8). These algorithms have been applied to a variety of descriptions of chemical structure-e.g., Hansch-type parameters (4, 9), topological descriptors (2, 10), quantum mechanical descriptors (9), sub-The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. structural unit...
A classic problem from chemistry is used to test a conjecture that in domains for which data are most naturally represented by graphs, theories constructed with inductive logic programming (ILP) will significantly outperform those using simpler feature-based methods. One area that has long been associated with graph-based or structural representation and reasoning is organic chemistry. In this field, we consider the problem of predicting the mutagenic activity of small molecules: a property that is related to carcinogenicity, and an important consideration in developing less hazardous drugs. By providing an ILP system with progressively more structural information concerning the molecules, we compare the predictive power of the logical theories constructed against benchmarks set by regression, neural, and tree-based methods.
Summary: We initiated the Predictive Toxicology Challenge (PTC) to stimulate the development of advanced SAR techniques for predictive toxicology models. The goal of this challenge is to predict the rodent carcinogenicity of new compounds based on the experimental results of the US National Toxicology Program (NTP). Submissions will be evaluated on quantitative and qualitative scales to select the most predictive models and those with the highest toxicological relevance.
Availability: http://www.informatik.uni-freiburg.de/~ml/ptc/
Contact: helma@informatik.uni-freiburg.de
* To whom correspondence should be addressed.
We introduce a novel algorithm for decision tree learning in the multi-instance setting as originally defined by Dietterich et al. It differs from existing multi-instance tree learners in a few crucial, well-motivated details. Experiments on synthetic and real-life datasets confirm the beneficial effect of these differences and show that the resulting system outperforms the existing multi-instance decision tree learners.
Inductive Logic Programming (ILP) is an area of Machine Learning which has now reached its twentieth year. Using the analogy of a human biography this paper recalls the development of the subject from its infancy through childhood and teenage years. We show how in each phase ILP has been characterised by an attempt to extend theory and implementations in tandem with the development of novel and challenging real-world applications. Lastly, by projection we suggest directions for research which will help the subject coming of age.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.