Aggrandizing Hadoop in terms of node Heterogeneity &amp;amp; Data Locality

Sujitha, S.; Jaganathan, Suresh

doi:10.1109/icsss.2013.6623017

Cited by 5 publications

(6 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The input to the task must be present on a node where task is supposed to be executed otherwise needs transferring input data which ultimately increased execution time. Sujitha et al [218] proposed a methodology to address the issues of heterogeneity and data locality in Hadoop.…”

Section: Locality Aware Data Placement In Heterogeneous Environmentmentioning

confidence: 99%

Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and A Future System Architecture

Usman¹,

Mehmood²,

Katib³

et al. 2022

Preprint

View full text Add to dashboard Cite

Big data has revolutionised science and technology leading to the transformation of our societies. High Performance Computing (HPC) provides the necessary computational power for big data analysis using artificial intelligence and methods. Traditionally HPC and big data had focused on different problem domains and had grown into two different ecosystems. Efforts have been underway for the last few years on bringing the best of both paradigms into HPC and big converged architectures. Designing HPC and big data converged systems is a hard task requiring careful placement of data, analytics, and other computational tasks such that the desired performance is achieved with the least amount of resources. Energy efficiency has become the biggest hurdle in the realisation of HPC, big data, and converged systems capable of delivering exascale and beyond performance. Data locality is a key parameter of HPDA system design as moving even a byte costs heavily both in time and energy with an increase in the size of the system. Performance in terms of time and energy are the most important factors for users, particularly energy, due to it being the major hurdle in high performance system design and the increasing focus on green energy systems due to environmental sustainability. Data locality is a broad term that encapsulates different aspects including bringing computations to data, minimizing data movement by efficient exploitation of cache hierarchies, reducing intra- and inter-node communications, locality-aware process and thread mapping, and in-situ and in-transit data analysis. This paper provides an extensive review of the cutting-edge on data locality in HPC, big data, and converged systems. We review the literature on data locality in HPC, big data, and converged environments and discuss challenges, opportunities, and future directions. Subsequently, using the knowledge gained from this extensive review, we propose a system architecture for future HPC and big data converged systems. To the best of our knowledge, there is no such review on data locality in converged HPC and big data systems.

show abstract

Section: Locality Aware Data Placement In Heterogeneous Environmentmentioning

confidence: 99%

Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and A Future System Architecture

Usman¹,

Mehmood²,

Katib³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To handle big data, a single machine is usually inadequate, so a cluster, which refers to a group of coordinated machines, needs to be set up to distribute the workload to different machines [73].…”

Section: Figure 1 Framework Architecturementioning

confidence: 99%

“…The technologies such as Amazon EMR, which is a big data platform to quickly and effectively process huge amounts of data, can be employed to set up a cluster, and Apache Spark can be used to process data in the cluster environment. A cluster (see Figure 3 [75]) typically includes one master node and several worker nodes -a node is an individual machine in the cluster [73]. It can be managed by using tools such as Apache YARN, which is formed by two types of daemons: a ResourceManager running on the master node and NodeManager(s) running on the worker node(s) [76].…”

Section: Figure 2 Data Analysis Process [74]mentioning

confidence: 99%

Harvesting Wisdom on Social Media for Business Decision Making

Yu¹,

Taşkın²,

Pauleen³

et al. 2022

Proceedings of the Annual Hawaii International Conference on System Sciences

View full text Add to dashboard Cite

The proliferation of social media provides significant opportunities for organizations to obtain wisdom of the crowds (WOC)-type data for decision making. However, critical challenges associated with collecting such data exist. For example, the openness of social media tends to increase the possibility of social influence, which may diminish group diversity, one of the conditions of WOC. In this research-inprogress paper, a new social media data analytics framework is proposed. It is equipped with welldesigned mechanisms (e.g., using different discussion processes to overcome social influence issues and boost social learning) to generate data and employs state-of-the-art big data technologies, e.g., Amazon EMR, for data processing and storage. Design science research methodology is used to develop the framework. This paper contributes to the WOC and social media adoption literature by providing a practical approach for organizations to effectively generate WOC-type data from social media to support their decision making.

show abstract

“…MapReduce is processing large-scale data via the distributed, parallel programming approach [2, 3]. However, the map and reduce processes are not optimized for heterogeneous environment [4]. Various approaches have been proposed to improve MapReduce performance in heterogeneous environment [1,4,5,6].…”

Section: Introductionmentioning

confidence: 99%

“…However, the map and reduce processes are not optimized for heterogeneous environment [4]. Various approaches have been proposed to improve MapReduce performance in heterogeneous environment [1,4,5,6]. [1] proposes a data placement algorithm, namely Dynamic Data Placement (DDP), to resolve the unbalanced node workload problem in heterogeneous environment.…”

Section: Introductionmentioning

confidence: 99%

Enhanced Dynamic Data Placement and Virtual Machine Creation for MapReduce

2017

SAHSS-2017, LEBCSR-17, LERIS-2017, Jan. 31-Feb. 1, 2017 Bali (Indonesia)

View full text Add to dashboard Cite

In this paper, we proposed a novel mechanism, namely Enhanced Dynamic Data Placement (EDDP). There are two components in EDDP: data partitioning and virtual machines (VMs) optimization. The first component is adapted from [Lee et al, 2014] whereby data placement and their size at the computing nodes must be proportional with their computation capability. In the second component, the configurations of the virtual machines created to handle the incoming jobs are optimized based on benchmarking. Experimental results show that EDDP managed to shorten job completion time.

show abstract

Aggrandizing Hadoop in terms of node Heterogeneity & Data Locality

Cited by 5 publications

References 4 publications

Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and A Future System Architecture

Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and A Future System Architecture

Harvesting Wisdom on Social Media for Business Decision Making

Enhanced Dynamic Data Placement and Virtual Machine Creation for MapReduce

Contact Info

Product

Resources

About