Similarity is a subjective and multifaceted concept, regardless of whether compounds or any other objects are considered. Despite its intrinsically subjective nature, attempts to quantify the similarity of compounds have a long history in chemical informatics and drug discovery. Many computational methods employ similarity measures to identify new compounds for pharmaceutical research. However, chemoinformaticians and medicinal chemists typically perceive similarity in different ways. Similarity methods and numerical readouts of similarity calculations are probably among the most misunderstood computational approaches in medicinal chemistry. Herein, we evaluate different similarity concepts, highlight key aspects of molecular similarity analysis, and address some potential misunderstandings. In addition, a number of practical aspects concerning similarity calculations are discussed.
The structural class of a protein domain can be approximately predicted according to its amino acid composition. However, can the prediction quality be improved by taking into account the coupling effect among different amino acid components? This question has evoked much controversy because completely different conclusions have been obtained by different investigators. To resolve such a perplexing problem, predictions by means of various algorithms were performed based on the SCOP database (Murzin et aL, 1995), which is more natural and reliable for the study of structural classes because it is based on evolutionary relationships and on the principles that govern their three-dimensional structure. The results obtained using both resubstitution and jackknife tests indicated that the overall rates of correct prediction by an algorithm incorporating the coupling effect among different amino acid components were significantly higher than those by the algorithms that did not include such an effect. A completely consistent conclusion was also obtained when tests were performed on two large independent testing datasets classified into four and seven structural classes, respectively. It is revealed through an analysis that the reasons for reaching the opposite conclusion are mainly due to (1) misclassifying structural classes according to a conceptually incorrect rule, (2) misapplying the component-coupled algorithm by ignoring some important factors and (3) misrepresenting structural classes with statistically insignificant training subsets. Clarification of these problems would be instructive for effectively using the prediction algorithm and correctly interpreting the results.
Medicinal chemists are frequently asked to review lists of compounds to assess their drug- or leadlike nature and to evaluate the suitability of lead compounds based on their "attractiveness" and/or synthetic feasibility as a basis for launching a drug-discovery campaign. It is often felt that one medicinal chemist's opinion is as good as any other, but is it? In an attempt to answer this question, an experiment was performed in conjunction with a recent compound acquisition program (CAP) conducted at Pharmacia. Historically, the CAP included a review of many thousands of compounds by medicinal chemists who eliminate anything deemed undesirable for any reason. In a review conducted in 2002, about 22 000 compounds requiring review by medicinal chemists were broken down into 11 lists of approximately 2000 compounds each. Unknown to the medicinal chemists, a subset of 250 compounds, previously rejected by a very experienced senior medicinal chemist, was added to each of the lists. Most of the 13 medicinal chemists who participated in this process reviewed two lists, although some only reviewed a single list and one reviewed three lists. Those compounds that were deemed unacceptable were recorded and tabulated in various ways to assess the consistency of the reviews. It was found that medicinal chemists were not very consistent in the compounds they rejected as being undesirable. The inconsistency arises from the subjective analysis that all humans utilize when considering "data sets" of any kind. This has important implications for pharmaceutical project teams where individual medicinal chemists review lists of primary screening hits to identify those compounds suitable for follow-up. Once a compound is removed from a list, it and other structurally similar compounds are effectively removed from further consideration. This can also have an impact on computational chemists who are developing models for assessing the desirability or attractiveness of different classes of compounds for lead discovery.
Molecular similarity is a pervasive concept in chemistry. It is essential to many aspects of chemical reasoning and analysis and is perhaps the fundamental assumption underlying medicinal chemistry. Dissimilarity, the complement of similarity, also plays a major role in a growing number of applications of molecular diversity in combinatorial chemistry, high-throughput screening, and related fields. How molecular information is represented, called the representation problem, is important to the type of molecular similarity analysis (MSA) that can be carried out in any given situation. In this work, four types of mathematical structure are used to represent molecular information: sets, graphs, vectors, and functions. Molecular similarity is a pairwise relationship that induces structure into sets of molecules, giving rise to the concept of chemical space. Although all three concepts - molecular similarity, molecular representation, and chemical space - are treated in this chapter, the emphasis is on molecular similarity measures. Similarity measures, also called similarity coefficients or indices, are functions that map pairs of compatible molecular representations that are of the same mathematical form into real numbers usually, but not always, lying on the unit interval. This chapter presents a somewhat pedagogical discussion of many types of molecular similarity measures, their strengths and limitations, and their relationship to one another. An expanded account of the material on chemical spaces presented in the first edition of this book is also provided. It includes a discussion of the topography of activity landscapes and the role that activity cliffs in these landscapes play in structure-activity studies.
We report consensus Structure-Activity Similarity (SAS) maps that address the dependence of activity landscapes on molecular representation. As a case study, we characterized the activity landscape of 54 compounds with activities against human cathepsin B (hCatB), human cathepsin L (hCatL), and Trypanosoma brucei cathepsin B (TbCatB). Starting from an initial set of 28 descriptors we selected ten representations that capture different aspects of the chemical structures. These included four 2D (MACCS keys, GpiDAPH3, pairwise, and radial fingerprints) and six 3D (4p and piDAPH4 fingerprints with each including three conformers) representations. Multiple conformers are used for the first time in consensus activity landscape modeling. The results emphasize the feasibility of identifying consensus data points that are consistently formed in different reference spaces generated with several fingerprint models, including multiple 3D conformers. Consensus data points are not meant to eliminate data, disregarding, for example, "true" activity cliffs that are not identified by some molecular representations. Instead, consensus models are designed to prioritize the SAR analysis of activity cliffs and other consistent regions in the activity landscape that are captured by several molecular representations. Systematic description of the SARs of two targets give rise to the identification of pairs of compounds located in the same region of the activity landscape of hCatL and TbCatB suggesting similar mechanisms of action for the pairs involved. We also explored the relationship between property similarity and activity similarity and found that property similarities are suitable to characterize SARs. We also introduce the concept of structure-property-activity (SPA) similarity in SAR studies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.