Schema matching identifies elements of two given schemas that correspond to each other. Although there are many algorithms for schema matching, little has been written about building a system that can be used in practice. We describe our initial experience building such a system, a customizable schema matcher called Protoplasm. Schema MatchingMost systems-integration work requires creating mappings between models, such as database schemas, message formats, interface definitions, and user-interface forms. As a design activity, schema mapping is similar to database design in that it requires digging deeply into the semantics of schemas. This is usually quite time consuming, not only in level-of-effort but also in elapsedtime, because the process of teasing out semantics is slow. It can benefit from the development of improved tools, but it is unlikely such tools will provide a silver bullet that automates all of the work. For example, it is hard to imagine how the best tool could eliminate the need for a designer to read documentation, ask developers and endusers how they use the data, or review test runs to check for unexpected results. In a sense, the problem is AI-complete, that is, as hard as reproducing human intelligence.The best commercially-available model mapping tools we know of are basically graphical programming tools. That is, they allow one to specify a schema mapping as a directed graph where nodes are simple data transformations and edges are data flows. Such tools help specify: a mapping between two messages, a data warehouse loading script, or a database query. While such graphical programming is an improvement over typing code, no database design intelligence is being offered. Despite the limited expectations expressed in the previous paragraph, we should certainly be able to offer some intelligence automated help, in addition to attractive graphics.There are two steps to automating the creation of mappings between schemas: schema matching and query discovery. Schema matching identifies elements that correspond to each other, but does not explain how they correspond. For example, it might say that FirstName and LastName in one schema are related to Name in the other, but not say that concatenating the former yields the latter. Query discovery picks up where schema matching leaves off. Given the correspondences, it obtains queries to translate instances of the source schema into instances of the target, e.g., using query analysis and data mining [5].This paper focuses on schema matching. There are many algorithms to solve it [8]. They exploit name similarity, thesauri, schema structure, instances, value distribution of instances, past mappings, constraints, cluster analysis of a schema corpus, and similarity to standard schemas. All of these algorithms have merit. So what we need is a toolset that incorporates them in an integrated package. This is the subject of this paper.The published work on schema matching is mostly about algorithms, not systems. This algorithm work is helpful, off...
Architecture and quality in data warehouses -An extended repository approach Jarke, M.; Jeusfeld, M.A.; Quix, C.; Vassiliadis, P. Published in: Information SystemsPublication date: 1999 Link to publicationCitation for published version (APA): Jarke, M., Jeusfeld, M. A., Quix, C., & Vassiliadis, P. (1999). Architecture and quality in data warehouses -An extended repository approach. Information Systems, 24(3), 229-253. General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.-Users may download and print one copy of any publication from the public portal for the purpose of private study or research -You may not further distribute the material or use it for any profit-making activity or commercial gain -You may freely distribute the URL identifying the publication in the public portal Take down policy If you believe that this document breaches copyright, please contact us providing details, and we will remove access to the work immediately and investigate your claim. (1) Informatik V, RWTH Aachen, 52056 Aachen, Germany (2) Infolab, KUB University, Postbus 90153, 5000 LE Tilburg, The Netherlands (3) Computer Science Division, NTUA Athens, Zographou 15773 Athens, Greece $EVWUDFW -Most database researchers have studied data warehouses (DW) in their role as buffers of materialized views, mediating between update-intensive OLTP systems and query-intensive decision support. This neglects the organizational role of data warehousing as a means of centralized information flow control. As a consequence, a large number of quality aspects relevant for data warehousing cannot be expressed with the current DW meta models. This paper makes two contributions towards solving these problems. Firstly, we enrich the meta data about DW architectures by explicit enterprise models. Secondly, many very different mathematical techniques for measuring or optimizing certain aspects of DW quality are being developed. We adapt the Goal-Question-Metric approach from software quality management to a meta data management environment in order to link these special techniques to a generic conceptual framework of DW quality. The approach has been implemented in full on top of the ConceptBase repository system and has undergone some validation by applying it to the support of specific quality-oriented methods, tools, and application projects in data warehousing.
Abstract. Data warehouses are complex systems that have to deliver highly-aggregated, high quality data from heterogeneous sources to decision makers. Due to the dynamic change in the requirements and the environment, data warehouse system rely on meta databases to control their operation and to aid their evolution. In this paper, we present an approach to assess the quality of the data warehouse via a semantically rich model of quality management in a data warehouse. The model allows stakeholders to design abstract quality goals that are translated to executable analysis queries on quality measurements in the data warehouse's meta database. The approach is being implemented using the ConceptBase meta database system.
As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. Data Lake systems have been proposed as a solution to this problem, by providing a schema-less repository for raw data with a common access interface. However, just dumping all data into a data lake without any metadata management, would only lead to a 'data swamp'. To avoid this, we propose Constance 1 , a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources. Constance discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities. With embedded query rewriting engines supporting structured data and semi-structured data, Constance provides users a unified interface for query processing and data exploration. During the demo, we will walk through each functional component of Constance. Constance will be applied to two real-life use cases in order to show attendees the importance and usefulness of our generic and extensible data lake system.
In addition to volume and velocity, Big data is also characterized by its variety. Variety in structure and semantics requires new integration approaches which can resolve the integration challenges also for large volumes of data. Data lakes should reduce the upfront integration costs and provide a more flexible way for data integration and analysis, as source data is loaded in its original structure to the data lake repository. Some syntactic transformation might be applied to enable access to the data in one common repository; however, a deep semantic integration is done only after the initial loading of the data into the data lake. Thereby, data is easily made available and can be restructured, aggregated, and transformed as required by later applications. Metadata management is a crucial component in a data lake, as the source data needs to be described by metadata to capture its semantics. We developed a Generic and Extensible Metadata Management System for data lakes (called GEMMS) that aims at the automatic extraction of metadata from a wide variety of data sources. Furthermore, the metadata is managed in an extensible metamodel that distinguishes structural and semantical metadata. The use case applied for evaluation is from the life science domain where the data is often stored only in files which hinders data access and efficient querying. The GEMMS framework has been proven to be useful in this domain. Especially, the extensibility and flexibility of the framework are important, as data and metadata structures in scientific experiments cannot be defined a priori.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.