Peer-to-peer (P2P) networks are gaining popularity in many applications such as file sharing, e-commerce, and social networking, many of which deal with rich, distributed data sources that can benefit from data mining. P2P networks are, in fact, well-suited to distributed data mining (DDM), which deals with the problem of data analysis in environments with distributed data,computing nodes,and users. This article offers an overview of DDM applications and algorithms for P2P environments, focusing particularly on local algorithms that perform data analysis by using computing primitives with limited communication overhead. The authors describe both exact and approximate local P2P data mining algorithms that work in a decentralized and communication-efficient manner. LANs, peer-to-peer (P2P) networks, mobile ad hoc wireless networks (Manets), and other pervasive distributed computing environments often include distributed data and computation sources. Data mining in such networks naturally calls for proper utilization of these distributed resources in an efficient, decentralized manner. Data mining algorithms that require substantial communication among the nodes, synchronous computing nodes, and complete centralized control have difficulty scaling in such distributed environments. Moreover, privacy concerns and resource issues in multiparty applications often dictate that data sets collected at different sites be analyzed in a distributed fashion without collecting everything to central sites. Most off-theshelf data mining products are designed to work as monolithic centralized applications, downloading relevant data to centralized locations to perform data mining operations, but this centralized approach doesn't work well in many emerging distributed data mining applications.Distributed data mining (DDM) offers an alternate approach to address this problem of mining data using distributed resources. DDM pays careful attention to distributed data, computing, communication, and human resources to use them in a nearoptimal fashion. Distributed P2P systems are emerging as a solution of choice for a new breed of applications such as file sharing, collaborative movie and song scoring, electronic commerce, and surveillance using sensor networks. DDM is gaining increasing attention in this domain for advanced data-driven applications.This article presents an overview of efforts to use DDM technology in P2P networks. Our goal is to present a high-level introduction to this field with pointers for
This paper offers a scalable and robust distributed algorithm for decision tree induction in large Peer-to-Peer (P2P) environments. Computing a decision tree in such large distributed systems using standard centralized algorithms can be very communication-expensive and impractical because of the synchronization requirements. The problem becomes even more challenging in the distributed stream monitoring scenario where the decision tree needs to be updated in response to changes in the data distribution. This paper presents an alternate solution that works in a completely asynchronous manner in distributed environments and offers low communication overhead, a necessity for scalability. It also seamlessly handles changes in data and peer failures. The paper presents extensive experimental results to corroborate the theoretical claims.
Abstract-In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the model of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date.Computing global data mining models e.g. decision trees, kmeans clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.