Protein-protein interactions are involved in nearly all regulatory processes in the cell and are considered one of the most important issues in molecular biology and pharmaceutical sciences but are still not fully understood. Structural and computational biology contributed greatly to the elucidation of the mechanism of protein interactions. In this paper, we present a collection of the physicochemical and structural characteristics that distinguish interface-forming residues (IFR) from free surface residues (FSR). We formulated a linear discriminative analysis (LDA) classifier to assess whether chosen descriptors from the BlueStar STING database (http://www.cbi.cnptia.embrapa.br/SMS/) are suitable for such a task. Receiver operating characteristic (ROC) analysis indicates that the particular physicochemical and structural descriptors used for building the linear classifier perform much better than a random classifier and in fact, successfully outperform some of the previously published procedures, whose performance indicators were recently compared by other research groups. The results presented here show that the selected set of descriptors can be utilized to predict IFRs, even when homologue proteins are missing (particularly important for orphan proteins where no homologue is available for comparative analysis/indication) or, when certain conformational changes accompany interface formation. The development of amino acid type specific classifiers is shown to increase IFR classification performance. Also, we found that the addition of an amino acid conservation attribute did not improve the classification prediction. This result indicates that the increase in predictive power associated with amino acid conservation is exhausted by adequate use of an extensive list of independent physicochemical and structural parameters that, by themselves, fully describe the nano-environment at protein-protein interfaces. The IFR classifier developed in this study is now integrated into the BlueStar STING suite of programs. Consequently, the prediction of protein-protein interfaces for all proteins available in the PDB is possible through STING_interfaces module, accessible at the following website: (http://www.cbi.cnptia.embrapa.br/SMS/predictions/index.html).
Protein secondary structure elements (PSSEs) such as α-helices, β-strands, and turns are the primary building blocks of the tertiary protein structure. Our primary interest here is to reveal the characteristics of the nanoenvironment formed by both PSSEs and their surrounding amino acid residues (AARs), which might contribute to the general understanding of how proteins fold. The characteristics of such nanoenvironments must be specific to each secondary structure element, and we have set our goal here to gather the fullest possible description of the α-helical nanoenvironment. In general, this postulate (the existence of specific nanoenvironments for specific protein substructures/neighbourhoods/regions with distinct functionality) was already successfully explored and confirmed for some protein regions, such as protein-protein interfaces and enzyme catalytic sites. Consequently, PSSEs were the obvious next choice for additional work for further evidence showing that specific nanoenvironments (having characteristics fully describable by means of structural and physical chemical descriptors) do exist for the corresponding and determined intraprotein regions. The nanoenvironment of α-helices (nEoαH) is defined as any region of the protein where this secondary structure element type is detected. The nEoαH, therefore, includes not only the α-helix amino acid residues but also the residues immediately around the α-helix. The hypothesis that motivated this work is that it might in fact be possible to detect a postulated “signal” or “signature” that distinguishes the specific location of α-helices. This “signal” must be discernible by tracking differences in the values of physical, chemical, physicochemical, structural and geometric descriptors immediately before (or after) the PSSE from those in the region along the α-helices. The search for this specific nanoenvironment “signal” was made possible by aligning previously selected α-helices of equal length. Afterward, we calculated the average value, standard deviation and mean square error at each aligned residue position for each selected descriptor. We applied Student’s t-test, the Kolmogorov-Smirnov test and MANOVA statistical tests to the dataset constructed as described above, and the results confirmed that the hypothesized “signal”/“signature” is both existing/identifiable and capable of distinguishing the presence of an α-helix inside the specific nanoenvironment, contextualized as a specific region within the whole protein. However, such conclusion might rarely be reached if only one descriptor is considered at a time. A more accurate signal with broader coverage is achieved only if one applies multivariate analysis, which means that several descriptors (usually approximately 10 descriptors) should be considered at the same time. To a limited extent (up to a maximum of 15% of cases), such conclusion is also possible with only a single descriptor, and the conclusion is also possible in general for up to 50–80% of cases when no less than 5 nonlinear descriptors a...
The term "agrochemicals" is used in its generic form to represent a spectrum of pesticides, such as insecticides, fungicides or bactericides. They contain active components designed for optimized pest management and control, therefore allowing for economically sound and labor efficient agricultural production. A "drug" on the other side is a term that is used for compounds designed for controlling human diseases. Although drugs are subjected to much more severe testing and regulation procedures before reaching the market, they might contain exactly the same active ingredient as certain agrochemicals, what is the case described in present work, showing how a small chemical compound might be used to control pathogenicity of Gram negative bacteria Xylella fastidiosa which devastates citrus plantations, as well as for control of, for example, meningitis in humans. It is also clear that so far the production of new agrochemicals is not benefiting as much from the in silico new chemical compound identification/discovery as pharmaceutical production. Rational drug design crucially depends on detailed knowledge of structural information about the receptor (target protein) and the ligand (drug/agrochemical). The interaction between the two molecules is the subject of analysis that aims to understand relationship between structure and function, mainly deciphering some fundamental elements of the nanoenvironment where the interaction occurs. In this work we will emphasize the role of understanding nanoenvironmental factors that guide recognition and interaction of target protein and its function modifier, an agrochemical or a drug. The repertoire of nanoenvironment descriptors is used for two selected and specific cases we have approached in order to offer a technological solution for some very important problems that needs special attention in agriculture: elimination of pathogenicity of a bacterium which is attacking citrus plants and formulation of a new fungicide. Finally, we also briefly describe a workflow which might be useful when research requires that model structures of target proteins are firstly generated (starting from genome sequences), followed by identification of ligand-target sites at the surface of those modeled structures, then application of procedures that adequately prepare both protein and ligand structures (the latter also involving filtration that satisfies acceptable adsorption/desorption/metabolism/excretion/toxicity [ADMET] parameters) for virtual high throughput screening (involving docking of ligands to indicated sites) and terminating by ranking of best pairs: target protein with selected ligand.
Background Animal pollination is an important ecosystem function and service, ensuring both the integrity of natural systems and human well-being. Although many knowledge shortfalls remain, some high-quality data sets on biological interactions are now available. The development and adoption of standards for biodiversity data and metadata has promoted great advances in biological data sharing and aggregation, supporting large-scale studies and science-based public policies. However, these standards are currently not suitable to fully support interaction data sharing. Results Here we present a vocabulary of terms and a data model for sharing plant–pollinator interactions data based on the Darwin Core standard. The vocabulary introduces 48 new terms targeting several aspects of plant–pollinator interactions and can be used to capture information from different approaches and scales. Additionally, we provide solutions for data serialization using RDF, XML, and DwC-Archives and recommendations of existing controlled vocabularies for some of the terms. Our contribution supports open access to standardized data on plant–pollinator interactions. Conclusions The adoption of the vocabulary would facilitate data sharing to support studies ranging from the spatial and temporal distribution of interactions to the taxonomic, phenological, functional, and phylogenetic aspects of plant–pollinator interactions. We expect to fill data and knowledge gaps, thus further enabling scientific research on the ecology and evolution of plant–pollinator communities, biodiversity conservation, ecosystem services, and the development of public policies. The proposed data model is flexible and can be adapted for sharing other types of interactions data by developing discipline-specific vocabularies of terms.
Secondary structure elements are generally found in almost all protein structures revealed so far. In general, there are more β-sheets than α helices found inside the protein structures. For example, considering the PDB, DSSP and Stride definitions for secondary structure elements and by using the consensus among those, we found 60,727 helices in 4,376 chains identified in all-α structures and 129,440 helices in 7,898 chains identified in all-α and α + β structures. For β-sheets, we identified 837,345 strands in 184,925 β-sheets located within 50,803 chains of all-β structures and 1,541,961 strands in 355,431 β-sheets located within 86,939 chains in all-β and α + β structures (data extracted on February 1, 2019). In this paper we would first like to address a full characterization of the nanoenvironment found at beta sheet locations and then compare those characteristics with the ones we already published for alpha helical secondary structure elements. For such characterization, we use here, as in our previous work about alpha helical nanoenvironments, set of STING protein structure descriptors. As in the previous work, we assume that we will be able to prove that there is a set of protein structure parameters/attributes/descriptors, which could fully describe the nanoenvironment around beta sheets and that appropriate statistically analysis will point out to significant changes in values for those parameters when compared for loci considered inside and outside defined secondary structure element. Clearly, while the univariate analysis is straightforward and intuitively understood, it is severely limited in coverage: it could be successfully applied at best in up to 25% of studied cases. The indication of the main descriptors for the specific secondary structure element (SSE) by means of the multivariate MANOVA test is the strong statistical tool for complete discrimination among the SSEs, and it revealed itself as the one with the highest coverage. The complete description of the nanoenvironment, by analogy, might be understood in terms of describing a key lock system, where all lock mini cylinders need to combine their elevation (controlled by a matching key) to open the lock. The main idea is as follows: a set of descriptors (cylinders in the key-lock example) must precisely combine their values (elevation) to form and maintain a specific secondary structure element nanoenvironment (a required condition for a key being able to open a lock).
For those biologists and biodiversity data managers who are unfamiliar with information science data practices of data standardization, the use of complex software to assist in the creation of standardized datasets can be a barrier to sharing data. Since the ratification of the Darwin Core Standard (DwC) (Darwin Core Task Group 2009) by the Biodiversity Information Standards (TDWG) in 2009, many datasets have been published and shared through a variety of data portals. In the early stages of biodiversity data sharing, the protocol Distributed Generic Information Retrieval (DiGIR), progenitor of DwC, and later the protocols BioCASe and TDWG Access Protocol for Information Retrieval (TAPIR) (De Giovanni et al. 2010) were introduced for discovery, search and retrieval of distributed data, simplifying data exchange between information systems. Although these protocols are still in use, they are known to be inefficient for transferring large amounts of data (GBIF 2017). Because of that, in 2011 the Global Biodiversity Information Facility (GBIF) introduced the Darwin Core Archive (DwC-A), which allows more efficient data transfer, and has become the preferred format for publishing data in the GBIF network. DwC-A is a structured collection of text files, which makes use of the DwC terms to produce a single, self-contained dataset. Many tools for assisting data sharing using DwC-A have been introduced, such as the Integrated Publishing Toolkit (IPT) (Robertson et al. 2014), the Darwin Core Archive Assistant (GBIF 2010) and the Darwin Core Archive Validator. Despite promoting and facilitating data sharing, many users have difficulties using such tools, mainly because of the lack of training in information science in the biodiversity curriculum (Convention on Biological Diversiity 2012, Enke et al. 2012). However, most users are very familiar with spreadsheets to store and organize their data, but the adoption of the available solutions requires data transformation and training in information science and more specifically, biodiversity informatics. For an example of how spreadsheets can simplify data sharing see Stoev et al. (2016). In order to provide a more "familiar" approach to data sharing using DwC-A, we introduce a new tool as a Google Sheet Add-on. The Add-on, called Darwin Core Archive Assistant Add-on can be installed in the user's Google Account from the G Suite MarketPlace and used in conjunction with the Google Sheets application. The Add-on assists the mapping of spreadsheet columns/fields to DwC terms (Fig. 1), similar to IPT, but with the advantage that it does not require the user to export the spreadsheet and import it into another software. Additionally, the Add-on facilitates the creation of a star schema in accordance with DwC-A, by the definition of a "CORE_ID" (e.g. occurrenceID, eventID, taxonID) field between sheets of a document (Fig. 2). The Add-on also provides an Ecological Metadata Language (EML) (Jones et al. 2019) editor (Fig. 3) with minimal fields to be filled in (i.e., mandatory fields required by IPT), and helps users to generate and share DwC-Archives stored in the user's Google Drive, which can be downloaded as a DwC-A or automatically uploaded to another public storage resource like a user's Zenodo Account (Fig. 4). We expect that the Google Sheet Add-on introduced here, in conjunction with IPT, will promote biodiversity data sharing in a standardized format, as it requires minimal training and simplifies the process of data sharing from the user's perspective, mainly for those users not familiar with IPT, but that historically have worked with spreadsheets. Although the DwC-A generated by the add-on still needs to be published using IPT, it does provide a simpler interface (i.e., spreadsheet) for mapping data sets to DwC than IPT. Even though the IPT includes many more features than the Darwin Core Assistant Add-on, we expect that the Add-on can be a "starting point" for users unfamiliar with biodiversity informatics before they move on to more advanced data publishing tools. On the other hand, Zenodo integration allows users to share and cite their standardized data sets without publishing them via IPT, which can be useful for users without access to an IPT installation. Additionally, we are working on new features and future releases will include the automatic generation of Global Unique Identifiers for shared records, the possibility of adding additional data standards and DwC extensions, integration with GBIF REST API and with IPT REST API.
Biodiversity is a data-intensive science and relies on data from a large number of disciplines in order to build up a coherent picture of the extent and trajectory of life on earth (Bowker 2000). The ability to integrate such data from different disciplines, geographic regions and scales is crucial for making better decisions towards sustainable development. As the Biodiversity Information Standards (TDWG) community tackles standards development and adoption beyond its initial emphases on taxonomy and species distributions, expanding its impact and engaging a wider audience becomes increasingly important. Biological interactions data (e.g., predator-prey, host-parasite, plant-pollinator) have been a topic of interest within TDWG for many years and a Biological Interaction Data Interest Group (IG) was established in 2016 to address that issue. The IG has been working on the complexity of representing interactions data and surveying how Darwin Core (DwC, Wieczorek 2012) is being used to represent them (Salim 2022). The importance of cross-disciplinary science and data inspired the recently funded WorldFAIR project—Global cooperation on FAIR data policy and practice—coordinated by the Committee on Data of the International Science Council (CODATA), with the Research Data Alliance (RDA) as a major partner. WorldFAIR will work with a set of case studies to advance implementation of the FAIR data principles (Fig. 1). The FAIR data principles promote good practices in data management, by making data and metadata Findable, Accessible, Interoperable, and Reusable (Wilkinson 2016). Interoperability will be a particular focus to facilitate cross-disciplinary research. A set of recommendations and a framework for FAIR assessment in a set of disciplines will be developed (Molloy 2022). One of WorldFAIR's case studies is related to plant-pollinator interactions data. Its starting point is the model and schema proposed by Salim (2022) based on the DwC standard, which adheres to the diversifying GBIF data model strategy and on the Plant-Pollinator vocabulary described by Salim (2021). The case study on plant-pollinator interactions originated in the TDWG Biological Interaction Data Interest Group (IG) and within the RDA Improving Global Agricultural Data (IGAD) Community of Practice. IGAD is a forum for sharing experiences and providing visibility to research and work in food and agricultural data and has become a space for networking and blending ideas related to data management and interoperability. This topic was chosen because interoperability of plant-pollinator data is needed for better monitoring of pollination services, understanding the impacts of cultivated plants on wild pollinators and quantifying the contribution of wild pollinators to cultivated crops, understanding the impact of domesticated bees on wild ecosystems, and understanding the behaviour of these organisms and how this influences their effectiveness as pollinators. In addition to the ecological importance of these data, pollination is economically important for food production. In Brazil, the economic value of the pollination service was estimated at US$ 12 billion in 2018 (Wolowski 2019). All eleven case studies within the WorldFAIR project are working on FAIR Implementation Profiles (FIPs), which capture comprehensive sets of FAIR principle implementation choices made by communities of practice and which can accelerate convergence and facilitate cross-collaboration between disciplines (Schultes 2020). The FIPs are published through the FIP Wizard, which allows the creation of FAIR Enabling Resources. The FIPs creation will be repeated by the end of the project and capture results obtained from each case study in order to advance data interoperability. In the first FIP, resources from the Global Biodiversity Information Facility (GBIF) and Global Biotic Interactions (GloBI) were catalogued by the Plant-Pollinator Case Study team, and we expect to expand the existing FAIR Enabling Resources by the end of the project and contribute to plant-pollinator data interoperability and reuse. To tackle the challenge of promoting FAIR data for plant-pollinator interactions within the broad scope of the several disciplines and subdisciplines that generate and use them, we will conduct a survey of existing initiatives handling plant-pollinator interactions data and summarise the current status of best practices in the community. Once the survey is concluded, we will choose at least five agriculture-specific plant-pollination initiatives from our partners, to serve as targets for standards adoption. For data to be interoperable and reusable, it is essential that standards and best practices are community-developed to ensure adoption by the tool builders and data scientists across the globe. TDWG plays an important role in this scenario and we expect to engage the IG and other interested parties in that discussion.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.