Abstract. Knowledge Discovery in Databases KDD is currently a hot topic in industry and academia. Although KDD is now widely accepted as a complex process of many di erent phases, the focus of research b ehind most emerging products is on underlying algorithms and modelling techniques. The main bottleneck for KDD applications is not the lack o f techniques. The challenge is to exploit and combine existing algorithms e ectively, and help the user during all phases of the KDD process. In this paper, we describe the project Citrus which addresses these practically relevant issues. Starting from a commercially a v ailable system, we develop a scaleable, extensible tool inherently based on the view of KDD as an interactive and iterative process. We s k etch the main components of this system, namely an information manager for e ective retrieval of data and results, an execution server for e cient execution, and a process support interface for guiding the user through the process.
Data quality problems have been a persistent concern especially for large historically grown databases. If maintained over long periods, interpretation and usage of their schemas often shifts. Therefore, traditional data scrubbing techniques based on existing schema and integrity constraint documentation are hardly applicable. So-called data auditing environments circumvent this problem by using machine learning techniques in order to induce semantically meaningful structures from the actual data, and then classifying outliers that do not fit the induced schema as potential errors. However, as the quality of the analyzed database is a-priori unknown, the design of data auditing environments requires special methods for the calibration of error measurements based on the induced schema. In this paper, we present a data audit test generator that systematically generates and pollutes artificial benchmark databases for this purpose. The test generator has been implemented as part of a data auditing environment based on the well-known machine learning algorithm C4.5. Validation in the partial quality audit of a large service-related database at DaimlerChrysler shows the usefulness of the approach as a complement to standard data scrubbing.
REVI-MINER is a KDD-environment which supports the detection and analysis of deviations in warranty and goodwill cost statements. The system was developed within the framework of a cooperation between DaimlerChrysler Research & Technology and Global Service and Parts (GSP) and is based upon the CRISP-DM methodology as a widely accepted process model for the solution of Data Mining problems. Also, we have implemented different approaches based on Machine l.earning and statistics which can be utilized for data cleaning in the preprocessing phase. The Data Mining models applied have been developed by using a statistical deviation detection approach. The tool supports controllers in their task of auditing the authorized repair shops. In this paper we describe the development phases which have led to REVI-MINER.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.