Abstract:IntroductionToday, a large amount of data is being produced in many areas including: e-commerce, social network, finance, health-care, and education. This increase in data volume consequently increases the need for an efficient computing framework to process this data and transform it into meaningful information. In the past years, many distributed computing frameworks [1][2][3][4][5][6] have been developed to perform large-scale data processing. MapReduce [2] (with its open-source implementation, Hadoop [7]) … Show more
“…This was applied to a large dataset and an initial centroid; many reducers caused high speed while keeping very close accuracy. In another article [12], the authors introduced a new system called iiHadoop (incremental iterative) to increase calculations on the small segment of data that is pretentious by vicissitudes instead of all the data. This system recovers the presentation for the Map and reduces responsibilities.…”
The conventional procedures of clustering algorithms are incapable of overcoming the difficulty of managing and analyzing the rapid growth of generated data from different sources. Using the concept of parallel clustering is one of the robust solutions to this problem. Apache Hadoop architecture is one of the assortment ecosystems that provide the capability to store and process the data in a distributed and parallel fashion. In this paper, a parallel model is designed to process the k-means clustering algorithm in the Apache Hadoop ecosystem by connecting three nodes, one is for server (name) nodes and the other two are for clients (data) nodes. The aim is to speed up the time of managing the massive scale of healthcare insurance dataset with the size of 11 GB and also using machine learning algorithms, which are provided by the Mahout Framework. The experimental results depict that the proposed model can efficiently process large datasets. The parallel k-means algorithm outperforms the sequential k-means algorithm based on the execution time of the algorithm, where the required time to execute a data size of 11 GB is around 1.847 hours using the parallel k-means algorithm, while it equals 68.567 hours using the sequential k-means algorithm. As a result, we deduce that when the nodes number in the parallel system increases, the computation time of the proposed algorithm decreases.
“…This was applied to a large dataset and an initial centroid; many reducers caused high speed while keeping very close accuracy. In another article [12], the authors introduced a new system called iiHadoop (incremental iterative) to increase calculations on the small segment of data that is pretentious by vicissitudes instead of all the data. This system recovers the presentation for the Map and reduces responsibilities.…”
The conventional procedures of clustering algorithms are incapable of overcoming the difficulty of managing and analyzing the rapid growth of generated data from different sources. Using the concept of parallel clustering is one of the robust solutions to this problem. Apache Hadoop architecture is one of the assortment ecosystems that provide the capability to store and process the data in a distributed and parallel fashion. In this paper, a parallel model is designed to process the k-means clustering algorithm in the Apache Hadoop ecosystem by connecting three nodes, one is for server (name) nodes and the other two are for clients (data) nodes. The aim is to speed up the time of managing the massive scale of healthcare insurance dataset with the size of 11 GB and also using machine learning algorithms, which are provided by the Mahout Framework. The experimental results depict that the proposed model can efficiently process large datasets. The parallel k-means algorithm outperforms the sequential k-means algorithm based on the execution time of the algorithm, where the required time to execute a data size of 11 GB is around 1.847 hours using the parallel k-means algorithm, while it equals 68.567 hours using the sequential k-means algorithm. As a result, we deduce that when the nodes number in the parallel system increases, the computation time of the proposed algorithm decreases.
In Data Science, knowledge generated by a resource-intensive analytics process is a valuable asset. Such value, however, tends to decay over time as a consequence of the evolution of any of the elements the process depends on: external data sources, libraries, and system dependencies. It is therefore important to be able to (i) detect changes that may partially or completely invalidate prior outcomes, (ii) determine the impact that those changes will have on those prior outcomes, ideally without having to perform expensive re-computations, and (iii) optimise the process re-execution needed to selectively refresh affected outcomes. This paper presents an extensive experimental study on how the selective re-computation problem manifests itself in a relevant analytics task for Genomics, namely variant calling and clinical interpretation, and how the problem can be addressed using a combination of approaches. Starting from this experience, we then offer a blueprint for a generic re-computation meta-process that makes use of process history metadata to make informed decisions about selective recomputations in reaction to a variety of changes in the data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.