Virtualization as a platform for resource-intensive applications, such as MapReduce (MR), has been the subject of many studies in the last years, as it has brought benefits such as better manageability, overall resource utilization, security and scalability. Nevertheless, because of the performance overheads, virtualization has traditionally been avoided in computing environments where performance is a critical factor. In this context, container-based virtualization can be considered a lightweight alternative to the traditional hypervisor-based virtualization systems. In fact, there is a trend towards using containers in MR clusters in order to provide resource sharing and performance isolation (e.g., Mesos and YARN). However, there are still no studies evaluating the performance overhead of the current container-based systems and their ability to provide performance isolation when running MR applications. In this work, we conducted experiments to effectively compare and contrast the current container-based systems (Linux VServer, OpenVZ and Linux Containers (LXC)) in terms of performance and manageability when running on MR clusters. Our results showed that although all container-based systems reach a near-native performance for MapReduce workloads, LXC is the one that offers the best relationship between performance and management capabilities (specially regarding to performance isolation).
The rise of Internet of Things sensors, social networking and mobile devices has led to an explosion of available data. Gaining insights into this data has led to the area of Big Data analytics. The MapReduce framework, as implemented in Hadoop, is one of the most popular frameworks for Big Data analysis. To handle the ever-increasing data size, Hadoop is a scalable framework that allows dedicated, seemingly unbound numbers of servers to participate in the analytics process. Response time of an analytics request is an important factor for time to value/insights. While the compute and disk I/O requirements can be scaled with the number of servers, scaling the system leads to increased network traffic. Arguably, the communication-heavy phase of MapReduce contributes significantly to the overall response time; the problem is further aggravated, if communication patterns are heavily skewed, as is not uncommon in many MapReduce workloads. In this paper we present a system that reduces the skew impact by transparently predicting data communication volume at runtime and mapping the many end-to-end flows among the various processes to the underlying network, using emerging software-defined networking technologies to avoid hotspots in the network. Dependent on the network oversubscription ratio , we demonstrate reduction in job completion time between 3% and 46% for popular MapReduce benchmarks like Sort and Nutch.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.