This paper evaluates the effectiveness of various similarity coefficients for 2D similarity searching when multiple bioactive target structures are available. Similarity searches using several different activity classes within the MDL Drug Data Report and the Dictionary of Natural Products databases are performed using BCI 2D fingerprints. Using data fusion techniques to combine the resulting nearest neighbor lists we obtain group recall results which, in many cases, are a considerable improvement on standard average recall values obtained for individual structures. It is shown that the degree of improvement can be related to the structural diversity of the activity class that is searched for, the best results being found for the most diverse groups. The group recall of active compounds using subsets of the class is also investigated: for highly self-similar activity classes, the group recall improvement saturates well before the full activity class size is reached. A rough correlation is found between the relative improvement using the group recall and the square of the number of unique compounds available in all of the merged lists. The Tanimoto coefficient is found unambiguously to be the best coefficient to use for the recovery of active compounds using multiple targets. Furthermore, when using the Tanimoto coefficient, the "MAX" fusion rule is found to be more effective than the "SUM" rule for the combination of similarity searches from multiple targets. The use of group recall can lead to improved enrichment in database searches and virtual screening.
Previous studies of the analysis of molecular matched pairs (MMPs) have often assumed that the effect of a substructural transformation on a molecular property is independent of the context (i.e., the local structural environment in which that transformation occurs). Experiments with large sets of hERG, solubility, and lipophilicity data demonstrate that the inclusion of contextual information can enhance the predictive power of MMP analyses, with significant trends (both positive and negative) being identified that are not apparent when using conventional, context-independent approaches.
A substructural analysis approach is used to calculate biological activity profiles, which contain weights that describe the differential occurrences of generic features (specifically, the numbers of hydrogen-bond donors and acceptors, the numbers of rotatable bonds and aromatic rings, the molecular weights, and the 2 kappa alpha descriptors) in active molecules taken from the World Drug Index and in (presumed) inactive molecules taken from the SPRESI database. Even with such simple structural descriptors, the profiles discriminate effectively between active and inactive compounds. The effectiveness of the approach is further increased by using a genetic algorithm for the calculation of the weights comprising a profile. The methods have been successfully applied to a number of different data sets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.