Task scheduling is critical to reduce the makespan of MapReduce jobs. It is an effective approach for scheduling optimization by improving the data locality, which involves attempting to locate a task and its related data block on the same node. However, recent studies have been insufficient in addressing the locality issue. This paper proposes BOLAS, a MapReduce task scheduling algorithm, which models the scheduling process as a bipartite-graph matching problem trying best to assign data block to the nearest task. By considering the divergence of node performance of distribution of data blocks in MapReduce cluster, BOLAS can achieve a high degree of data locality, guarantee minimal data transfer during execution, and reduces a job's makespan subsequently. As a dynamic algorithm, BOLAS solves the model using Kuhn-Munkres optimal matching algorithm, and can be deployed in either homogeneous or heterogeneous environments. In this study, BOLAS was implemented as a plugin for Hadoop, and the experimental results indicate that BOLAS can localize nearly 100% of the map tasks and reduce the execution time by up to 67.1%.
As a distributed MapReduce framework, Hadoop has been widely adopted in big data processing, in which HDFS (Hadoop Distributed File System) is mostly used for data storage. Though the single master architecture of HDFS simplifies the design and implementation, it suffers from issues such as SPOF (Single Point Of Failure) and scalability, which further may become performance bottleneck. To address these problems, this paper proposes NM 2 H, a NoSQL based metadata management approach for HDFS. NM 2 H separates the storage and query of metadata in contrast to the traditional architecture which mixed them up, and manages to keep the interfaces among the metadata service, DataNodes and clients unchanged through a novel mapping mechanism between the original metadata structures to NoSQL documents. Therefore, the new approach can not only take advantages of NoSQL's better scalability and fault tolerance, but also deliver transparency to client applications, in which way existing programs can run on the new architecture without any modification. The prototype of NM 2 H was designed and implemented with widely adopted NoSQL system MongoDB. Extensive performance evaluation was conducted and the experimental results indicated the improvement of NM 2 H, while the overhead introduced was acceptable.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.