Job completion in MapReduce framework depends upon the slowest running reduce task. Inordinate time gap among the completion points of reduce tasks delays a job significantly. Synchronization in reduce task completion not only completes a job faster but also increases resource utilization. Completion time of reduce tasks depends mainly upon the data load given to each of them. Partitioning of keys plays a major role in load balancing input data to the reduce tasks. Range partitioning using sampling technique has been proposed in the literature over hash based partitioning to provide better key distribution. But data sampling is also susceptible to data/partitioning skew since only a small portion of the actual data is scanned. It is very difficult to design a perfect load balancing partitioner without having prior information about the entire data. Chisel, allows data skew to occur during partitioning but handles it dynamically by creating multiple partial reducers and one global reducer for the skewed partition. But this technique suffers from the global reducer becoming a bottleneck when the amount of reducer output is huge. We propose an efficient range partitioning technique which eliminates the use of global reducer and yields better performance. We also discuss various heuristics to implement range partitioning. Chisel++ achieved 2 times improvement in terms of job completion time over Chisel in our experiments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.