Abstract. Data warehouses are traditionally refreshed in a periodic manner, most often on a daily basis. Thus, there is some delay between a business transaction and its appearance in the data warehouse. The most recent data is trapped in the operational sources where it is unavailable for analysis. For timely decision making, today's business users asks for ever fresher data.Near real-time data warehousing addresses this challenge by shortening the data warehouse refreshment intervals and hence, delivering source data to the data warehouse with lower latency. One consequence is that data warehouse refreshment can no longer be performed in off-peak hours only. In particular, the source data may be changed concurrently to data warehouse refreshment. In this paper we show that anomalies may arise under these circumstances leading to an inconsistent state of the data warehouse and we propose approaches to avoid refreshment anomalies.
Keywords:Near real-time data warehousing, Change Data Capture (CDC), Extract-Transform-Load (ETL), incremental loading of data warehouses.
Near Real-Time Data WarehousingData warehousing is a prominent approach to materialized data integration. Data of interest, scattered across multiple heterogeneous sources is integrated into a central database system referred to as the data warehouse. Data integration proceeds in three steps: Data of interest is first extracted from the sources, subsequently transformed and cleansed, and finally loaded into the data warehouse. Dedicated systems referred to as Extract-Transform-Load (ETL) tools have been built to support these data integration steps.The data warehouse facilitates complex data analyses without placing a burden on the operational source systems that run the day-to-day business. In order to catch up with data changes in the operational sources, the data warehouse is refreshed in a periodic manner, usually on a daily basis. Data warehouse M. Castellanos, U. Dayal, and R
SUMMARYThe successful adaptation of information integration techniques to the requirements of data Grids is essential for the proliferation of Grid technology. In addition to the well-known problems encountered when integrating heterogeneous sources, the dynamic Grid environment introduces new challenges. This paper discusses the problem of data source discovery, i.e. the selection of the most useful data sources for a given information demand out of a possibly very large set of candidates. We introduce the concept of data source utility and emphasize the pivotal role of semantic correspondences or schema matches for utility. Different variants of concrete utility measures used in an advanced Grid data source registry are presented.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.