As the explosive growth of the data volume, data center is playing a critical role to store and process huge amount of data. Traditional single data center can no longer to adapt into incredibly fastgrowing data. Recently, some researches have extended the tasks such data processing to geographically distributed data centers. However, since the joint consideration of task placement and data transfer, it is complex and difficult to design a proper scheduling approach with the goal of minimizing makespan under the constraint of task dependencies, processing capability and network, etc. Therefore, our work proposes JHT D : an efficient joint scheduling framework based on hypergraph for task placement and data transfer across geographically distributed data centers. Generally, there are two crucial stages in JHT D. Initially, due to the outstanding of hypergraphs in modeling complex problems, we have leveraged a hypergraph-based model to establish the relationship between tasks, data files, and data centers. Thereafter, a hypergraph-based partition method has been developed for task placement within the first stage. In the second stage, a task reallocation scheme has been devised in terms of each task-to-data dependency. Meanwhile, a data dependency aware transferring scheme has been designed to minimize the makespan. Last, the real-world model China-VO project has been used to conduct a variety of simulation experiments. The results have demonstrated that JHT D effectively optimizes the problems of task placement and data transfer across geographically distributed data centers. JHT D has been compared with three other stateof-the-art algorithms. The results have demonstrated that JHT D can reduce the makespan by up to 20.6%. Also, various impacts (data transfer volume and load balancing) have been taken into account to show and discuss the effectiveness of JHT D.INDEX TERMS Big data processing, Geographically distributed data centers, Joint scheduling framework, Hypergraph, Task placement, Data transferring. I. INTRODUCTIONWith the advent of Big Data era, the rate of data generation is dramatically increasing. For example, Internet giants such as Google and Facebook crunch more than 10 PB of data a day [1]. As a result, it is essential to improve the efficiency of data processing in the face the huge amount of data.MapReduce [2] and Spark [3] have been widely adopted to deal with large amounts of data. These frameworks usually process data analytic jobs characterized by data-dependency awareness. These jobs can be divided into a set of dependent tasks. The execution of a task not only requires the outcome of the parent tasks, but also the data. Normally, the data and
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.