Biomedical questions are often complex and address multiple topics simultaneously. Answering them requires the comprehensive evaluation of several different types of data. They are often available, but in distributed and heterogeneous data sources; this hampers their global evaluation. We developed a software architecture to create and maintain updated a Genomic and Proteomic Data Warehouse (GPDW), which integrates several of the main of such dispersed data. It uses a modular and multi-level global data schema based on abstraction and generalization of integrated data features. Such a schema eases integration of data sources evolving in data content, structure and number, and assures provenance tracking of all the integrated data. Thanks to the developed software architecture and adopted data schema, the GPDW has been kept updated easily and progressively extended with additional data types and sources; it is publicly usable at http://www.bioinformatics.dei.polimi.it/GPKB/
SummaryNumerous biomolecular data are available, but they are scattered in many databases and only some of them are curated by experts. Most available data are computationally derived and include errors and inconsistencies. Effective use of available data in order to derive new knowledge hence requires data integration and quality improvement. Many approaches for data integration have been proposed. Data warehousing seams to be the most adequate when comprehensive analysis of integrated data is required. This makes it the most suitable also to implement comprehensive quality controls on integrated data. We previously developed GFINDer (http://www.bioinformatics.polimi.it/GFINDer/), a web system that supports scientists in effectively using available information. It allows comprehensive statistical analysis and mining of functional and phenotypic annotations of gene lists, such as those identified by high-throughput biomolecular experiments. GFINDer backend is composed of a multi-organism genomic and proteomic data warehouse (GPDW). Within the GPDW, several controlled terminologies and ontologies, which describe gene and gene product related biomolecular processes, functions and phenotypes, are imported and integrated, together with their associations with genes and proteins of several organisms. In order to ease maintaining updated the GPDW and to ensure the best possible quality of data integrated in subsequent updating of the data warehouse, we developed several automatic procedures. Within them, we implemented numerous data quality control techniques to test the integrated data for a variety of possible errors and inconsistencies. Among other features, the implemented controls check data structure and completeness, ontological data consistency, ID format and evolution, unexpected data quantification values, and consistency of data from single and multiple sources. We use the implemented controls to analyze the quality of data available from several different biological databases and integrated in the GFINDer data warehouse. By doing so, we identified in these data a variety of different types of errors and inconsistencies; this enables us to ensure good quality of the data in the GFINDer data warehouse. We reported all identified data errors and inconsistencies to the curators of the original databases from where the data were retrieved, who mainly corrected them in subsequent updating of the original database. This contributed to improve the quality of the data available, in the original databases, to the whole scientific community.
Abstract. We demonstrate Liquid Queries, a novel user interaction paradigm for exploratory multi-domain search upon structured information collected from heterogeneous data sources. Liquid Queries support an exploratory search approach by providing a set of interaction primitives for multi-domain query formulation, result visualization and query refinement, with commands for perusing the result set, changing the visualization of data based on their type (e.g., geographical) and interacting with the remote search services.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.