Inspired by natural language processing techniques, we here introduce Mol2vec, which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Like the Word2vec models, where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that point in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing the vectors of the individual substructures and, for instance, be fed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pretrained once, yields dense vector representations, and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as a reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment-independent and thus can also be easily used for proteins with low sequence similarities.
Protein kinases are involved in a variety of diseases including cancer, inflammation, and autoimmune disorders. Although the development of new kinase inhibitors is a major focus in pharmaceutical research, a large number of kinases remained so far unexplored in drug discovery projects. The selection and assessment of targets is an essential but challenging area. Today, a few thousands of experimentally determined kinase structures are available, covering about half of the human kinome. This large structural source allows guiding the target selection via structure-based druggability prediction approaches such as DoGSiteScorer. Here, a thorough analysis of the ATP pockets of the entire human kinome in the DFG-in state is presented in order to prioritize novel kinase structures for drug discovery projects. For this, all human kinase X-ray structures available in the PDB were collected, and homology models were generated for the missing part of the kinome. DoGSiteScorer was used to calculate geometrical and physicochemical properties of the ATP pockets and to predict the potential of each kinase to be druggable. The results indicate that about 75% of the kinome are in principle druggable. Top ranking structures comprise kinases that are primary targets of known approved drugs but additionally point to so far less explored kinases. The presented analysis provides new insights into the druggability of ATP binding pockets of the entire kinome. We anticipate this comprehensive druggability assessment of protein kinases to be helpful for the community to prioritize so far untapped kinases for drug discovery efforts.
Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.
and higher temperature may provide a growth advantage for toxin-producing species (Kleinteich et al., 2012).MCs, which are produced by several cyanobacteria species, e.g., Microcystis spp., Dolichospermum spp. or Planktothrix spp., in water bodies worldwide (Preece et al., 2017), represent one of the toxin types most frequently associated with drinking water, food supplement and/or food contamination and have resulted in human health morbidity and mortality. Structurally, MC are cyclic heptapeptides consisting of common L-amino acids, but also uncommon and unique amino acids. Their general structure is cyclo(). X and Z stand for variable L-amino acids, while β-D-MeAsp is erythro-β-D-methylaspartate, ADDA is (2S,3S,8S,9S,4E,6E)-3-amino-9-methoxy-2,6,8-trimethyl-10phenyl-4,6-decadienoic acid and Mdha is N-methyldehydroalanine. The variable positions, along with various (de)methylation sites (Fig. 1, Tab. S1 1 ), provide for currently 248 known MC congeners (Spoof and Catherine, 2017), albeit new MC congeners are continuously being discovered. However, contrary to a recent
Current tracking technology such as GPS data loggers allows biologists to remotely collect large amounts of movement data for a large variety of species. Extending, and often replacing interpretation based on observation, the analysis of the collected data supports research on animal behaviour, on impact factors such as climate change and human intervention on the globe, as well as on conservation programs. However, this analysis is difficult, due to the nature of the research questions and the complexity of the data sets. It requires both automated analysis, for example, for the detection of behavioural patterns, and human inspection, for example, for interpretation, inclusion of previous knowledge, and for conclusions on future actions and decision making. For this analysis and inspection, the movement data needs to be put into the context of environmental data, which helps to interpret the behaviour. Thus, a major challenge is to design and develop methods and intuitive interfaces that integrate the data for analysis by biologists. We present a concept and implementation for the visual analysis of cheetah movement data in a web-based fashion that allows usage both in the field and in office environments. Graphic abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.