Recent studies [17,12] show that leveraging benefits of high performance interconnects like InfiniBand, MapReduce performance in terms of job execution time can be greatly enhanced by using additional features like in-memory merge, pipelined merge and reduce, and prefetching and caching of map outputs. In this paper, we validate that it is time to have a new performance model for the RDMA-based design of MapReduce over high performance interconnects. Our initial results derived from the proposed analytical model matches the experimental results within a 3-5% range.
MotivationAuthors in [17,12] present enhanced designs and algorithms for the RDMA-based MapReduce framework. With these design changes, MapReduce job execution can be greatly accelerated by leveraging the benefits of high-performance interconnects. The high performance design of Hadoop (Hadoop-RDMA) [3] also shows significant performance benefits achievable through RDMA-capable interconnects using enhanced designs of various components (HDFS [6], MapReduce [12], RPC [9]) inside Hadoop. On the other hand, much performance modeling research [4, 8, 2, 1, 13, 5, 7, 10, 11] has been carried out to deeply analyze the default MapReduce framework. But, because of the inherent architectural changes, these models are not appropriate for performance prediction of RDMA-based enhanced MapReduce. For example, Table 1 captures the performance evaluation for the Sort benchmark using default Hadoop [16] and enhanced MapReduce with RDMA [12] and compares these with the performance model in [4]. This clearly illustrates the necessity of a new model for the enhanced design of MapReduce.Table 1: Comparison using Sort
Our ApproachFor the RDMA-based enhanced design of MapReduce, all of the new features are added inside the ReduceTask. Thus, to predict the performance correctly for this design, we approach to model the performance of the ReduceTask from scratch. In the default MapReduce framework, execution time for a single ReduceTask, t RT is calculated from the execution times of different phases in the ReduceTask. t RT = t shu f f le + t merge + t reduce (1) For the RDMA-based design, on the other hand, t RT , will not be as simple as the default one. Because of the fully overlapping feature among these three phases, t RT can be rewritten as: t RT = max(t shu f f le ,t merge ) + α * t reduce (2) α represents the fraction of the total data that resides in memory yet to be reduced, while both shuffle and merge phases have completed their execution. Also, because of the architectural changes in the enhanced design, all of the parameters t shu f f le , t merge , and t reduce need to be re-modeled to incorporate all of the new design enhancements.
Contribution0 200 400 600 800 1,000 1,200 128 64 32 16 8 Job Executiion Time (sec) Cluster Size Experimental ModelFigure 1: Model validation in Stampede Cluster We validate our model for enhanced MapReduce using terasort [15] on Stampede [14]. We vary the cluster size from 8 to 128, while increasing the data size exponentially from 4...