DBDC:Dens i ty Bas ed D i str i b uted C l uste r ingE s h r ef Jan uza j ,Ha n s -P e t e r K r iegel,and Martin P feifle U niv e rsi ty of M u nic h ,In sti tut efo r C omp ute r Scien c e h ttp : //www.db s .infor m a t ik.u ni-mu enc hen.de { j a n uza j , k r iegel, pfeifle}@infor m a t ik.u ni-mu enc hen.de Abstr act . C l ust e r ingha s b e c ome a nin c r e a s ingly importa n t t a s kinmode r n a pplicat iondoma ins suc h a s m a r ket ing a nd p urc h a s ing a ssi sta n c e , m u l t imedi a, molec u l a r b iology a s w ell a s m a n y o t hers. I nmo st of thes e a r e a s , the d a t aar eor igina lly c ollec t ed a t diffe r ent s i t e s . I nor der t oextr act infor m a t ionfr om thes eda t a, t hey a r eme r ged a t acentra l si t e a nd then c l ust e r ed. I n this p a per , w ep r opo s e a diffe r ent a ppr o ach. W e c l ust e r t he d a t a locally a nd e xtr act sui t able repr e s ent a t i v e s f r om t hes e c l ust e rs. T hes e r epr e s ent a t i v e s a r e sentto a glo bal s e rve r s i t e wher e we re sto r e t he c omple t e c l ust e ring bas ed on the local r epr e s ent a t i v e s . T his a ppr o achisve ry effic ient ,be cause the local c l ust e r ing can b e carriedo ut q u i c kly a nd independent l y f r om e acho t her . F urt her mor e , weha v elo w tra n s mission c o st,as t he n u m b e r of tra n s mitted repr e s enta t i v e s i s m u c h s m a lle r t h a n the car din a lity of t he c omple t ed a t a s e t . Bas ed on t his s m a ll n u m b e r of repr e s ent a t i v e s , the global c l ust e r ing can b edone v e ry effi c ien t l y . F o r b o t h the local a nd the glo bal c l ust e r ing , w e us e a den s i ty bas ed c l ust e r ing a lgor i t hm. T he c omb ina t iono f b o t h the local a nd the glo bal c l ust e r ingf o r m s o ur new DBDC ( D ens i ty Bas ed D i str i b uted C l uste r ing) a lgo r i t hm. F urt her mor e , wed i s c uss t he c omple x p r o b lemoffinding a su i t able q u a lity mea sur efo r e v a l u a t ingdi str i b uted c l ust e r ing s . W ein trodu c e tw oqu a lity c r i t e r i a whic h a r e c omp a r ed toeachot her a nd w hic h a llowus t oe v a l u a t e the q u a lity of o ur DBDC a lgo r i t hm. I nour e x per iment a l e v a l u a t ion , w e will showth a tw edono t h a v e to sacr ific e c l ust e r ingqu a lity in o r der t oga in a neffi c ien c y a d v a n t a ge when us ingour distr i b uted c l ust e r ing a ppr o ach. I n trodu c t ionK now ledge D i s c o v e ry in Dat aba s e s ( KDD) tr ies t oiden t ifyva lid, nov el, pot ent i a lly usefu l , a nd ul t ima t ely u nde rst a ndable p a tte r n s in d a t a . T r a dit ion a l KDDa pplicat ion s r equ i r efu ll acc e ss t o the d a t a whic his going to b e a n a l yzed. A ll d a t a h a s t o b elo cat ed a t t h a t s i t e wher e i t i ssc rut ini z ed. N o w a d a ys, l a r ge a mou n ts of het e r ogeneous,comple x d a t a re s ideondifferent , independen t l y w o r king c omp ute rs whic h a r e c onnec t ed toeachot her v i a localor w ...
Abstract. Clustering has become an increasingly important task in analysing huge amounts of data. Traditional applications require that all data has to be located at the site where it is scrutinized. Nowadays, large amounts of heterogeneous, complex data reside on different, independently working computers which are connected to each other via local or wide area networks. In this paper, we propose a scalable density-based distributed clustering algorithm which allows a user-defined trade-off between clustering quality and the number of transmitted objects from the different local sites to a global server site. Our approach consists of the following steps: First, we order all objects located at a local site according to a quality criterion reflecting their suitability to serve as local representatives. Then we send the best of these representatives to a server site where they are clustered with a slightly enhanced density-based clustering algorithm. This approach is very efficient, because the local determination of suitable representatives can be carried out quickly and independently from each other. Furthermore, based on the scalable number of the most suitable local representatives, the global clustering can be done very effectively and efficiently. In our experimental evaluation, we will show that our new scalable density-based distributed clustering approach results in high quality clusterings with scalable transmission cost.
Modern information systems consist of many distributed computer and database systems. The integration of such distributed data into a single data warehouse system is confronted with the well known problem of low data quality. In this paper we present an approach that facilitates a dynamic identification of spurious and error-prone data stored in a large data warehouse. The identification of data quality problems is based on data mining techniques, such as clustering, subspace clustering and classification. Furthermore, we present via a case study the applicability of our approach on real data. The experimental results show that our approach efficiently identifies data quality problems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.