This paper aims to quantitatively measure the impact of different data centers networking topologies on the performance and energy efficiency of shuffling operations in MapReduce. Mixed Integer Linear Programming (MILP) models are utilized to optimize the shuffling in several data center topologies with electronic, hybrid, and all-optical switching while maximizing the throughput and reducing the power consumption. The results indicate that the networking topology has a significant impact on the performance of MapReduce. They also indicate that with comparable performance, optical-based data centers can achieve an average of 54% reduction in the energy consumption when compared to electronic switching data centers. Keywords: Data Center Networking (DCN), MapReduce, energy efficiency, completion time.
INTRODUCTIONThe MapReduce programming model and its widely-used platform, Hadoop, are enabling several costeffective cloud-based big data services [1]. These services typically require extensive all-to-all communications between hosting servers leading to increased congestion and power consumption in data centers. Moreover, they result in the East-West traffic dominating over the South-North traffic. This new traffic trend has become the focus in designing state-of-art production data centers [2]. These challenges are increasingly motivating the consideration of all-optical networking in future data centers to cope with the increasing demands of big data applications while improving the data centers performance and decreasing their power consumption [3].The processing in MapReduce is composed of map, shuffle, and reduce phases. The input data is stored in several servers' local disks and is globally managed by a distributed file system (DFS) [1]. The processing starts by assigning map slots according to the number of input data chunks and available computing resources, and reduce slots according to the user's configurations. If chunks are more than map slots, the map phase will run in several waves according to their scheduling [4]. Each map slot processes it assigned chunks, preferably available locally, and generates intermediate results in the form of < key,value> pairs. The intermediate results are shuffled to reduce slots according to their keys where each slot is assigned to process a unique set of keys [1]. Finally, each reduce slot sorts its inputs, calculates final results, and saves them in the DFS.Several optimization studies have been carried out by both academia and industry to enhance the performance and energy efficiency of big data applications (e.g. [2], [4]- [21]). The performance of big data applications and frameworks such as MapReduce is associated with a wide range of factors and parameters such as the cluster specifications (e.g. CPU, memory, networking, and disk I/O resources [9]), framework used or version, in addition to selected configurations and mechanisms for data and jobs placements and tasks scheduling [4]- [8]. Moreover, as the deployments of big data applications are evolvin...