The Hadoop Distributed File System (HDFS) is designed to store and transfer data in large scale. To ensure availability and reliability, it uses data replication as a fault tolerance mechanism. However, this strategy can significantly affect replication balancing in the cluster. This paper provides an analysis of the default data replication policy used by HDFS and measures its impacts on the system behavior, while presenting different strategies for cluster balancing and rebalancing. In order to highlight the required requirements for efficient replica placement, a comparative study of the HDFS performance has been conduct considering a variety of factors that may result in cluster imbalance.
Data replication is the main fault tolerance mechanism of HDFS, the Hadoop Distributed File System. Although replication is essential to ensure high availability and reliability, the replicas might not always be placed evenly among the nodes. The HDFS Balancer is an integrated solution of Apache Hadoop that performs replica balancing through the rearrangement of the data blocks stored in the file system. The Balancer, however, demands a high computational effort of the nodes during its operation. This work presents a customization for the HDFS Balancer that considers the status of the nodes as a strategy to minimize the overhead caused by the balancing operation in the cluster. To this end, metrics obtained at runtime are used as a way to prioritize the nodes during data redistribution, making it occurs primarily between nodes with low communication traffic. Also, the Balancer starts to operate aiming at a minimum balance level, reducing the number of data transfers required to even up the data stored in the cluster. The evaluation results showed that the proposed customization allows reducing the time and bandwidth needed to reach the system balance.
A replicação de dados é um dos principais mecanismos de tolerância a falhas utilizados pelo HDFS. Porém, a forma de posicionamento das réplicas entre os nodos computacionais afeta diretamente o balanceamento e o desempenho do sistema. O HDFS Balancer é uma solução disponibilizada pelo Apache Hadoop que visa equilibrar a distribuição dos dados. Todavia, sua política de operação atual não permite endereçar demandas de disponibilidade e confiabilidade ao redistribuir as réplicas entre os racks do cluster. Esse trabalho apresenta uma estratégia de balanceamento customizada para o HDFS Balancer baseada em fatores de confiança, que são calculados para cada rack a partir da taxa de falhas de seus nodos. Após detalhar a implementação, conduziu-se uma investigação experimental que permitiu validar e demonstrar a efetividade da estratégia desenvolvida.
MapReduce is a parallel programming paradigm successfully used to perform computations on massive amounts of data, being widely deployed on clusters, grid, and cloud infrastructures. Interestingly, while the emergence of cloud infrastructures has opened new perspectives, several enterprises hesitate to put sensible data on the cloud and prefer to rely on internal resources. In this paper we introduce the PER-MARE initiative, which aims at proposing scalable techniques to support existent MapReduce data-intensive applications in the context of loosely coupled networks such as pervasive and desktop grids. By relying on the MapReduce programming model, PER-MARE proposes to explore the potential advantages of using free unused resources available at enterprises as pervasive grids, alone or in a hybrid environment. This paper presents the main lines that orient the PER-MARE approach and some preliminary results.
This study presents the advances on designing and implementing scalable techniques to support the development and execution of MapReduce application in pervasive distributed computing infrastructures, in the context of the PER-MARE project. A pervasive framework for MapReduce applications is very useful in practice, especially in those scientific, enterprises and educational centers which have many unused or underused computing resources, which can be fully exploited to solve relevant problems that demand large computing power, such as scientific computing applications, big data processing, etc. In this study, we propose the study of multiple techniques to support volatility and heterogeneity on MapReduce, by applying two complementary approaches: Improving the Apache Hadoop middleware by including context-awareness and fault-tolerance features; and providing an alternative pervasive grid implementation, fully adapted to dynamic environments. The main design and implementation decisions for both alternatives are described and validated through experiments, demonstrating that our approaches provide high reliability when executing on pervasive environments. The analysis of the experiments also leads to several insights on the requirements and constraints from dynamic and volatile systems, reinforcing the importance of context-aware information and advanced fault-tolerance features to provide efficient and reliable MapReduce services on pervasive grids.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.