Contemporary bioinformatic and chemoinformatic capabilities hold promise to reshape knowledge management, analysis and interpretation of data in natural products research. Currently, reliance on a disparate set of non-standardized, insular, and specialized databases presents a series of challenges for data access, both within the discipline and for integration and interoperability between related fields. The fundamental elements of exchange are referenced structure-organism pairs that establish relationships between distinct molecular structures and the living organisms from which they were identified. Consolidating and sharing such information via an open platform has strong transformative potential for natural products research and beyond. This is the ultimate goal of the newly established LOTUS initiative, which has now completed the first steps toward the harmonization, curation, validation and open dissemination of 750,000+ referenced structure-organism pairs. LOTUS data is hosted on Wikidata and regularly mirrored on https://lotus.naturalproducts.net. Data sharing within the Wikidata framework broadens data access and interoperability, opening new possibilities for community curation and evolving publication models. Furthermore, embedding LOTUS data into the vast Wikidata knowledge graph will facilitate new biological and chemical insights. The LOTUS initiative represents an important advancement in the design and deployment of a comprehensive and collaborative natural products knowledge base.
By combining bioinformatics with quantumchemical calculations, we attempt to address quantitatively some of the physical principles underlying protein folding. The former allowed us to identify tripeptide sequences in existing protein three-dimensional structures with a strong preference for either helical or extended structure. The selected representatives of pro-helical and pro-extended sequences were converted into "isolated" tripeptidescapped at N-and C-terminiand these were subjected to an extensive conformational sampling and geometry optimization (typically thousands to tens of thousands of conformers for each tripeptide). For each conformer, the QM(DFT-D3)/ COSMO-RS free-energy value was then calculated, G conf (solv). The ΔG conf (solv) is expected to provide an objective, unbiased, and quantitatively accurate measure of the conformational preference of the particular tripeptide sequence. It has been shown that irrespective of the helical vs extended preferences of the selected tripeptide sequences in context of the protein, most of the low-energy conformers of isolated tripeptides prefer the R-helical structure. Nevertheless, pro-helical tripeptides show slightly stronger helix preference than their pro-extended counterparts. Furthermore, when the sampling is repeated in the presence of a partner tripeptide to mimic the situation in a β-sheet, pro-extended tripeptides (exemplified by the VIV) show a larger free-energy benefit than pro-helical tripeptides (exemplified by the EAM). This effect is even more pronounced in a hydrophobic solvent, which mimics the less polar parts of a protein. This is in line with our bioinformatic results showing that the majority of pro-extended tripeptides are hydrophobic. The preference for a specific secondary structure by the studied tripeptides is thus governed by the plasticity to adopt to its environment. In addition, we show that most of the "naturally occurring" conformations of tripeptide sequences, i.e., those found in existing three-dimensional protein structures, are within ∼10 kcal•mol −1 from their global minima. In summary, our "ab initio" data suggest that complex protein structures may start to emerge already at the level of their small oligopeptidic units, which is in line with a hierarchical nature of protein folding.
Large biomolecules—proteins and nucleic acids—are composed of building blocks which define their identity, properties and binding capabilities. In order to shed light on the energetic side of interactions of amino acids between themselves and with deoxyribonucleotides, we present the Amino Acid Interaction web server (http://bioinfo.uochb.cas.cz/INTAA/). INTAA offers the calculation of the residue Interaction Energy Matrix for any protein structure (deposited in Protein Data Bank or submitted by the user) and a comprehensive analysis of the interfaces in protein–DNA complexes. The Interaction Energy Matrix web application aims to identify key residues within protein structures which contribute significantly to the stability of the protein. The application provides an interactive user interface enhanced by 3D structure viewer for efficient visualization of pairwise and net interaction energies of individual amino acids, side chains and backbones. The protein–DNA interaction analysis part of the web server allows the user to view the relative abundance of various configurations of amino acid–deoxyribonucleotide pairs found at the protein–DNA interface and the interaction energies corresponding to these configurations calculated using a molecular mechanical force field. The effects of the sugar-phosphate moiety and of the dielectric properties of the solvent on the interaction energies can be studied for the various configurations.
BackgroundStructure search is one of the valuable capabilities of small-molecule databases. Fingerprint-based screening methods are usually employed to enhance the search performance by reducing the number of calls to the verification procedure. In substructure search, fingerprints are designed to capture important structural aspects of the molecule to aid the decision about whether the molecule contains a given substructure. Currently available cartridges typically provide acceptable search performance for processing user queries, but do not scale satisfactorily with dataset size.ResultsWe present Sachem, a new open-source chemical cartridge that implements two substructure search methods: The first is a performance-oriented reimplementation of substructure indexing based on the OrChem fingerprint, and the second is a novel method that employs newly designed fingerprints stored in inverted indices. We assessed the performance of both methods on small, medium, and large datasets containing 1, 10, and 94 million compounds, respectively. Comparison of Sachem with other freely available cartridges revealed improvements in overall performance, scaling potential and screen-out efficiency.ConclusionsThe Sachem cartridge allows efficient substructure searches in databases of all sizes. The sublinear performance scaling of the second method and the ability to efficiently query large amounts of pre-extracted information may together open the door to new applications for substructure searches.
As contemporary bioinformatic and chemoinformatic capabilities are reshaping natural products research, major benefits could result from an open database of referenced structure-organism pairs. Those pairs allow the identification of distinct molecular structures found as components of heterogeneous chemical matrices originating from living organisms. Current databases with such information suffer from paywall restrictions, limited taxonomic scope, poorly standardized fields, and lack of interoperability. To ensure data quality, references to the work that describes the structure-organism relationship are mandatory. To fill this void, we collected and curated a set of structure-organism pairs from publicly available natural products databases to yield LOTUS (naturaL prOducTs occUrrences databaSe), which contains over 500,000 curated and referenced structure-organism pairs. All the programs developed for data collection, curation, and dissemination are publicly available. To provide unlimited access as well as standardized linking to other resources, LOTUS data is both hosted on Wikidata and regularly mirrored on https://lotus.naturalproducts.net. The diffusion of these referenced structure-organism pairs within the Wikidata framework addresses many of the limitations of currently-available databases and facilitates linkage to existing biological and chemical data resources. This resource represents an important advancement in the design and deployment of a comprehensive and collaborative natural products knowledge base.Graphical abstractFigure 1:Graphical abstract
Motivation The existing connections between large databases of chemicals, proteins, metabolites and assays offer valuable resources for research in fields ranging from drug design to metabolomics. Transparent search across multiple databases provides a way to efficiently utilize these resources. To simplify such searches, many databases have adopted semantic technologies that allow interoperable querying of the datasets using SPARQL query language. However, the interoperable interfaces of the chemical databases still lack the functionality of structure-driven chemical search, which is a fundamental method of data discovery in the chemical search space. Results We present a SPARQL service that augments existing semantic services by making interoperable substructure and similarity searches in small-molecule databases possible. The service thus offers new possibilities for querying interoperable databases, and simplifies writing of heterogeneous queries that include chemical-structure search terms. Availability The service is freely available and accessible using a standard SPARQL endpoint interface. The service documentation and user-oriented demonstration interfaces that allow quick explorative querying of datasets are available at https://idsm.elixir-czech.cz.
The Resource Description Framework (RDF), together with well-defined ontologies, significantly increases data interoperability and usability. The SPARQL query language was introduced to retrieve requested RDF data and to explore links between them. Among other useful features, SPARQL supports federated queries that combine multiple independent data source endpoints. This allows users to obtain insights that are not possible using only a single data source. Owing to all of these useful features, many biological and chemical databases present their data in RDF, and support SPARQL querying. In our project, we primary focused on PubChem, ChEMBL and ChEBI small-molecule datasets. These datasets are already being exported to RDF by their creators. However, none of them has an official and currently supported SPARQL endpoint. This omission makes it difficult to construct complex or federated queries that could access all of the datasets, thus underutilising the main advantage of the availability of RDF data. Our goal is to address this gap by integrating the datasets into one database called the Integrated Database of Small Molecules (IDSM) that will be accessible through a SPARQL endpoint. Beyond that, we will also focus on increasing mutual interoperability of the datasets. To realise the endpoint, we decided to implement an in-house developed SPARQL engine based on the PostgreSQL relational database for data storage. In our approach, data are stored in the traditional relational form, and the SPARQL engine translates incoming SPARQL queries into equivalent SQL queries. An important feature of the engine is that it optimises the resulting SQL queries. Together with optimisations performed by PostgreSQL, this allows efficient evaluations of SPARQL queries. The endpoint provides not only querying in the dataset, but also the compound substructure and similarity search supported by our Sachem project. Although the endpoint is accessible from an internet browser, it is mainly intended to be used for programmatic access by other services, for example as a part of federated queries. For regular users, we offer a rich web application called ChemWebRDF using the endpoint. The application is publicly available at https://idsm.elixir-czech.cz/chemweb/.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.