In many application fields such as social networks, e-commerce and content delivery networks there is a constant production of big amounts of data in geographically distributed sites that need to be timely elaborated. Distributed computing frameworks such as Hadoop (based on the MapReduce paradigm) have been used to process big data by exploiting the computing power of many cluster nodes interconnected through high speed links. Unfortunately, Hadoop was proved to perform very poorly in the just mentioned scenario. We designed and developed a Hadoop framework that is capable of scheduling and distributing hadoop tasks among geographically distant sites in a way that optimizes the overall job performance. We propose a hierarchical approach where a top-level entity, by exploiting the information concerning the data location, is capable of producing a smart schedule of low-level, independent MapReduce sub-jobs. A software prototype of the framework was developed. Tests run on the prototype showed that the job scheduler makes good forecasts of the expected job's execution time.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.