The quantification of chemical diversity has many applications in drug discovery, organic chemistry, food, and natural product chemistry, to name a few. As the size of the chemical space is expanding rapidly, it is imperative to develop efficient methods to quantify the diversity of large and ultralarge chemical libraries and visualize their mutual relationships in chemical space. Herein, we show an application of our recently introduced extended similarity indices to measure the fingerprint-based diversity of 19 chemical libraries typically used in drug discovery and natural products research with over 18 million compounds. Based on this concept, we introduce the Chemical Library Networks (CLNs) as a general and efficient framework to represent visually the chemical space of large chemical libraries providing a global perspective of the relation between the libraries. For the 19 compound libraries explored in this work, it was found that the (extended) Tanimoto index offers the best description of extended similarity in combination with RDKit fingerprints. CLNs are general and can be explored with any structure representation and similarity coefficient for large chemical libraries.
Extended (or n-ary) similarity indices have been recently proposed to extend the comparative analysis of binary strings. Going beyond the traditional notion of pairwise comparisons, these novel indices allow comparing any number of objects at the same time. This results in a remarkable efficiency gain with respect to other approaches, since now we can compare N molecules in O(N) instead of the common quadratic O(N 2 ) timescale. This favorable scaling has motivated the application of these indices to diversity selection, clustering, phylogenetic analysis, chemical space visualization, and post-processing of molecular dynamics simulations. However, the current formulation of the n-ary indices is limited to vectors with binary or categorical inputs. Here, we present the further generalization of this formalism so it can be applied to numerical data, i.e. to vectors with continuous components. We discuss several ways to achieve this extension and present their analytical properties. As a practical example, we apply this formalism to the problem of feature selection in QSAR and prove that the extended continuous similarity indices provide a convenient way to discern between several sets of descriptors.Recently, we have introduced several methodological frameworks to extend the usage of similarity measures beyond the common cases mentioned above. Most importantly, we have demonstrated that the mathematical expansion of the core concepts of similarity measures can provide a way to quantify the similarity of an arbitrary number of objects at the same time. We first showed this on binary (molecular) fingerprints: the resulting similarity measures were termed extended (or n-ary) similarity measures [15]. They employ the core concept of similarity and dissimilarity counters, which have replaced the a, b, c and d terms that are commonly applied in the well-known, pairwise definitions of the similarity measures to describe the number of bit positions where two fingerprints have co-occurring one (a) or zero (d) bits, or a one bit that is exclusive to either of the fingerprints (b and c). In our framework, the 1-similarity, 0-similarity, and dissimilarity counters express the number of bit positions where the number of co-occurring one (or zero) bits is above, or below, a predefined coincidence threshold, respectively. For pairwise comparisons, these generalizations naturally revert to the well-known definitions of the classical, pairwise similarity measures.We have shown that the new methodology is not only computationally efficient, scaling as O(n) with the number of compared objects n, but it can be successfully applied for tasks such as diversity selection, clustering, as well as the visualization of large sections of chemical space [16][17][18][19]. A further generalization involved the extension of this framework to allow for more than two possible characters (t = 2) in an object (vector), opening the possibility to apply the extended similarity measures in bioinformatics, for the comparison of nucleotide (t = 4) or prot...
A combination of in situ X-ray photoelectron spectroscopy and mass spectrometry has been used to elucidate the elementary surface reactions initiated by the interaction of low-energy (860 eV) argon ions with three organometallic precursors [Ru(CO)4I2, Co(CO)3NO, and WN(NMe2)3]. The effects of ion exposure on each precursor can be described by a largely sequential series of surface reactions. The initial step involves ion-induced decomposition of the precursor to create a nonvolatile deposit, followed by physical sputtering of the atoms in the deposit. For the precursors that contain CO ligands [Ru(CO)4I2 and Co(CO)3NO], ion-induced decomposition is accompanied by desorption of the majority of the CO groups. This is in marked contrast to previous studies of low-energy electron-induced reactions with the same precursors where precursor decomposition yielded only partial desorption of the CO ligands. Conversely, argon ion bombardment of WN(NMe2)3 led to decomposition without ligand loss. For all three precursors, the initial ion-induced decomposition step was not accompanied by significant desorption of intact precursor molecules, while during subsequent physical sputtering of the deposited atoms, ligand-derived organic and inorganic contaminants were removed at higher rates than the metals. This indicates that controlled ion beam deposition conditions could be used to produce deposits with high metal contents from all three precursors. Comparison of low-energy electron-induced reactions of these three precursors with results of this investigation indicates that secondary electrons do not play an important role in the deposition process, but rather precursor decomposition occurs via efficient ion–molecule energy transfer. These reactions are discussed in the context of focused ion beam-induced deposition.
Understanding structure-activity landscapes is essential in drug discovery.Similarly, it has been shown that the presence of activity cliffs in compound data sets can have a substantial impact not only on the design progress but also can influence the predictive ability of machine learning models. With the continued expansion of the chemical space and the currently available large and ultra-large libraries, it is imperative to implement efficient tools to analyze the activity landscape of compound data sets rapidly. The goal of this study is to show the applicability of the n-ary indices to quantify the structure-activity landscapes of large compound data sets using different types of structural representation rapidly and efficiently. We also discuss how a recently introduced medoid algorithm provides the foundation to finding optimum correlations between similarity measures and structure-activity rankings. The applicability of the n-ary indices and the medoid algorithm is shown by analyzing the activity landscape of 10 compound data sets with pharmaceutical relevance using three fingerprints of different designs, 16 extended similarity indices, and 11 coincidence thresholds.
Understanding structure-activity landscapes is essential in drug discovery. Similarly, it has been shown that the presence of activity cliffs in compound data sets can have a substantial impact not only on the design progress but also can influence the predictive ability of machine learning models. With the continued expansion of the chemical space and the currently available large and ultra-large libraries, it is imperative to implement efficient tools to analyze the activity landscape of compound data sets rapidly. The goal of this study is to show the applicability of the n-ary indices to quantify the structure-activity landscapes of large compound data sets using different types of structural representation rapidly and efficiently. We also discuss how a recently introduced medoid algorithm provides the foundation to finding optimum correlations between similarity measures and structure-activity rankings. The applicability of the n-ary indices and the medoid algorithm is shown by analyzing the activity landscape of 10 compound data sets with pharmaceutical relevance using three fingerprints of different designs, 16 extended similarity indices, and 11 coincidence thresholds.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.