Novel Data-Distribution Technique for Hadoop in Heterogeneous Cloud Environments

Ubarhande, Vrushali; Popescu, Daniela Elena; González–Vélez, Horacio

doi:10.1109/cisis.2015.37

Cited by 24 publications

(11 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, big data is processed by clustering and scans multiple nodes of clusters in the network [23] . This processing is based on the concept of parallelism to handle large medical data sets [24] . Freely available frameworks, such as Hadoop, MapReduce, Pig, Sqoop, Hive, and HBase Avro, all have ability to process the health related data sets for healthcare systems.…”

Section: Big Data Analytics Architecture For Health Informaticsmentioning

confidence: 99%

“…In the first component is the requirement for big data sources for processing. In the second component clusters with a centralized big-data processing infrastructure are at the peak of high performance [24] . It has been observed that the tools mainly available for big-data analytics processing provide data security, scalability, and manageability with the help of the MapReduce paradigm.…”

Section: Big Data Analytics Architecture For Health Informaticsmentioning

confidence: 99%

“…No data is actually stored on the name node. Files are stored as blocks in proper sequence and these blocks are equal in size [24] . The features of HDFS are its distributed nature and reliability.…”

Section: Hadoop's Tools and Techniques For Big Datamentioning

confidence: 99%

See 2 more Smart Citations

Big data analytics for healthcare industry: impact, applications, and tools

Kumar

Singh

2019

Big Data Min. Anal.

165

View full text Add to dashboard Cite

In recent years, huge amounts of structured, unstructured, and semi-structured data have been generated by various institutions around the world and, collectively, this heterogeneous data is referred to as big data. The health industry sector has been confronted by the need to manage the big data being produced by various sources, which are well known for producing high volumes of heterogeneous data. Various big-data analytics tools and techniques have been developed for handling these massive amounts of data, in the healthcare sector. In this paper, we discuss the impact of big data in healthcare, and various tools available in the Hadoop ecosystem for handling it. We also explore the conceptual architecture of big data analytics for healthcare which involves the data gathering history of different branches, the genome database, electronic health records, text/imagery, and clinical decisions support system.

show abstract

Section: Big Data Analytics Architecture For Health Informaticsmentioning

confidence: 99%

Section: Big Data Analytics Architecture For Health Informaticsmentioning

confidence: 99%

See 1 more Smart Citation

Big data analytics for healthcare industry: impact, applications, and tools

Kumar

Singh

2019

Big Data Min. Anal.

165

View full text Add to dashboard Cite

show abstract

“…Xie et al and Anjos et al explore the possibility of placing data blocks to minimize job latency. Data blocks are placed based on the computing ratio in other works, to minimize makespan, whereas Chen et al place data blocks to minimize network transfer time. Anjos et al considers the capacity of nodes to minimize the latency of a job.…”

Section: Literature Surveymentioning

confidence: 99%

Improving MapReduce scheduler for heterogeneous workloads in a heterogeneous environment

Jeyaraj

Ananthanarayana

Paul

2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary Big data is largely influencing business entities and research sectors to be more data‐driven. Hadoop MapReduce is one of the cost‐effective ways to process large scale datasets and offered as a service over the Internet. Even though cloud service providers promise an infinite amount of resources available on‐demand, it is inevitable that some of the hired virtual resources for MapReduce are left unutilized and makespan is limited due to various heterogeneities that exist while offering MapReduce as a service. As MapReduce v2 allows users to define the size of containers for the map and reduce tasks, jobs in a batch become heterogeneous and behave differently. Also, the different capacity of virtual machines in the MapReduce virtual cluster accommodate a varying number of map/reduce tasks. These factors highly affect resource utilization in the virtual cluster and the makespan for a batch of MapReduce jobs. Default MapReduce job schedulers do not consider these heterogeneities that exist in a cloud environment. Moreover, virtual machines in MapReduce virtual cluster process an equal number of blocks regardless of their capacity, which affects the makespan. Therefore, we devised a heuristic‐based MapReduce job scheduler that exploits virtual machine and MapReduce workload level heterogeneities to improve resource utilization and makespan. We proposed two methods to achieve this: (i) roulette wheel scheme based data block placement in heterogeneous virtual machines, and (ii) a constrained 2‐dimensional bin packing to place heterogeneous map/reduce tasks. We compared heuristic‐based MapReduce job scheduler against the classical fair scheduler in MapReduce v2. Experimental results showed that our proposed scheduler improved makespan and resource utilization by 45.6% and 47.9% over classical fair scheduler.

show abstract

“…Become a Big Data era these days, many tools that analyzing the massive data efficiently such as R or Hadoop are released [4][5][6]. Especially Hadoop has strong point that it's possible to distributed processing the massive data in low cost, there's a drift towards research about Hadoop or System using Hadoop [7][8][9]. Hadoop consists of two parts, HDFS (Hadoop Distributed File System) and MapReduce framework.…”

Section: Introductionmentioning

confidence: 99%