Evaluating Hadoop for Data-Intensive Scientific Operations

Fadika, Zacharia; Govindaraju, Madhusudhan; Canon, R. S.; Ramakrishnan, Lavanya

doi:10.1109/cloud.2012.118

Cited by 30 publications

(32 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, the authors report significant performance penalties due to the virtualization layer. Other work related to our study was reported in [35], where the authors analyzed the streaming features of Hadoop [36] and reported the existence of an overhead due to streaming. We have made a similar observation regarding the scalability of stream processing.…”

Section: Related Workmentioning

confidence: 99%

Evaluating Streaming Strategies for Event Processing Across Infrastructure Clouds

Tudoran

Keahey²,

Riteau³

et al. 2014

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

View full text Add to dashboard Cite

Abstract-Infrastructure clouds revolutionized the way in which we approach resource procurement by providing an easy way to lease compute and storage resources on short notice, for a short amount of time, and on a pay-as-you-go basis. This new opportunity, however, introduces new performance trade-offs. Making the right choices in leveraging different types of storage available in the cloud is particularly important for applications that depend on managing large amounts of data within and across clouds. An increasing number of such applications conform to a pattern in which data processing relies on streaming the data to a compute platform where a set of similar operations is repeatedly applied to independent chunks of data. This pattern is evident in virtual observatories such as the Ocean Observatory Initiative, in cases when new data is evaluated against existing features in geospatial computations or when experimental data is processed as a series of time events. In this paper, we propose two strategies for efficiently implementing such streaming in the cloud and evaluate them in the context of an ATLAS application processing experimental data. Our results show that choosing the right cloud configuration can improve overall application performance by as much as three times.

show abstract

Section: Related Workmentioning

confidence: 99%

Evaluating Streaming Strategies for Event Processing Across Infrastructure Clouds

Tudoran

Keahey²,

Riteau³

et al. 2014

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

View full text Add to dashboard Cite

show abstract

“…To evaluate the performance and energy efficiency of Hadoop applications in different Hadoop deployment scenarios we use three micro-benchmarks: TeraGen, TeraSort, and Wikipedia data processing [15]. The former two benchmarks are among the most widely used standard Hadoop benchmarks.…”

Section: A Workloadsmentioning

confidence: 99%

“…al [21] shows that a proper MapReduce implementation can achieve a performance close to parallel databases through experiments performed on Amazon EC2. Previous work [15] evaluated Hadoop for scientific applications and the tradeoffs of various hardware and file system configurations. Our work complements the aforementioned performance efforts by investigating the Hadoop performance with separated data and compute layers and specific data operations.…”

Section: Related Workmentioning

confidence: 99%

On the performance and energy efficiency of Hadoop deployment models

Feller

Ramakrishnan

Morin

2013

2013 IEEE International Conference on Big Data

Self Cite

View full text Add to dashboard Cite

Abstract-The exponential growth of scientific and business data has resulted in the evolution of the cloud computing and the MapReduce parallel programming model. Cloud computing emphasizes increased utilization and power savings through consolidation while MapReduce enables large scale data analysis. The Hadoop framework has recently evolved to the standard framework implementing the MapReduce model. In this paper, we evaluate Hadoop performance in both the traditional model of collocated data and compute services as well as consider the impact of separating out the services. The separation of data and compute services provides more flexibility in environments where data locality might not have a considerable impact such as virtualized environments and clusters with advanced networks. In this paper, we also conduct an energy efficiency evaluation of Hadoop on physical and virtual clusters in different configurations. Our extensive evaluation shows that: (1) performance on physical clusters is significantly better than on virtual clusters; (2) performance degradation due to separation of the services depends on the data to compute ratio; (3) application completion progress correlates with the power consumption and power consumption is heavily application specific.

show abstract

“…al [23] shows that a proper MapReduce implementation can achieve a performance close to parallel databases through experiments performed on Amazon EC2. Previous work [16] evaluated Hadoop for scientific applications and the trade-offs of various hardware and file system configurations.…”

Section: Related Workmentioning

confidence: 99%

“…To evaluate the performance and energy efficiency of Hadoop applications in different Hadoop deployment scenarios we use three micro-benchmarks: TeraGen, TeraSort, and Wikipedia data processing [16]. The former two benchmarks are among the most widely used standard Hadoop benchmarks.…”

Section: Workloadsmentioning

confidence: 99%

Performance and energy efficiency of big data applications in cloud environments: A Hadoop case study

Feller

Ramakrishnan

Morin

2015

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

The exponential growth of scientific and business data has resulted in the evolution of the cloud computing environments and the MapReduce parallel programming model. The focus of cloud computing is increased utilization and power savings through consolidation while MapReduce enables large scale data analysis.Hadoop, an open source implementation of MapReduce has gained popularity in the last few years. In this paper, we evaluate Hadoop performance in both the traditional model of collocated data and compute services as well as consider the impact of separating out the services. The separation of data and compute services provides more flexibility in environments where data locality might not have a considerable impact such as virtualized environments and clusters with advanced networks. In this paper, we also conduct an energy efficiency evaluation of Hadoop on physical and virtual clusters in different configurations. Our extensive evaluation shows that: (1) coexisting virtual machines on servers decrease the disk throughput; (2) performance on physical clusters is significantly better than on virtual clusters; (3) performance degradation due to separation of the services depends on the data to compute ratio; (4) application completion progress correlates with the power consumption and power consumption is heavily application specific. Finally, we present a discussion on the implications of using cloud environments for big data analyses.

show abstract

Evaluating Hadoop for Data-Intensive Scientific Operations

Cited by 30 publications

References 20 publications

Evaluating Streaming Strategies for Event Processing Across Infrastructure Clouds

Evaluating Streaming Strategies for Event Processing Across Infrastructure Clouds

On the performance and energy efficiency of Hadoop deployment models

Performance and energy efficiency of big data applications in cloud environments: A Hadoop case study

Contact Info

Product

Resources

About