The Hadoop distributed filesystem: Balancing portability and performance

Shafer, Jeffrey; Rixner, Scott; Cox, Alan L.

doi:10.1109/ispass.2010.5452045

Cited by 232 publications

(108 citation statements)

References 14 publications

Supporting

Mentioning

108

Contrasting

Order By: Relevance

“…The software framework is written in Java for distributed storage and distributed processing of very large datasets on computer clusters built from commodity hardware [37]. Hadoop is a highly scalable storage platform because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel.…”

Section: Scientific Programming Tools For Social Media Analysismentioning

confidence: 99%

Social Media Analytics and Intelligence

Jagdale¹

2018

IJCA

View full text Add to dashboard Cite

Recent advances in the internet have helped social media to influence many areas of business, marketing, weather forecasting, communication etc. Social media Analytics has become a vital part of the data analytics as the vast population is involved in social media and immeasurable insights can be taken from their activities in the social media websites. In this paper, we summarise different techniques used to analyze social media activities like tweets, blogs, etc., and to present the pros and cons of each. KeywordsKeywords are your own designated keywords which can be used for easy location of the manuscript using any search engines.

show abstract

Section: Scientific Programming Tools For Social Media Analysismentioning

confidence: 99%

Social Media Analytics and Intelligence

Jagdale¹

2018

IJCA

View full text Add to dashboard Cite

show abstract

“…HDFS plays a critical role in the Hadoop Ecosystem [13]. In this section, we focus on its runtime features.…”

Section: Hadoop Distributed File System (Hdfs)mentioning

confidence: 99%

Adaptable I/O System based I/O Reduction for Improving the Performance of HDFS

Park¹,

Kim²,

Koo³

et al. 2016

JSTS:Journal of Semiconductor Technology and Science

View full text Add to dashboard Cite

Abstract-In this paper, we propose a new HDFS-AIO framework to enhance HDFS with Adaptive I/O System (ADIOS), which supports many different I/O methods and enables applications to select optimal I/O routines for a particular platform without sourcecode modification and re-compilation. First, we customize ADIOS into a chunk-based storage system so its API semantics can fit the requirement of HDFS easily; then, we utilize Java Native Interface (JNI) to bridge HDFS and the tailored ADIOS. We use different I/O patterns to compare HDFS-AIO and the original HDFS, and the experimental results show the design feasibility and benefits. We also examine the performance of HDFS-AIO using various I/O techniques. There have been many studies that use ADIOS, however our research is expected to help in expanding the function of HDFS.

show abstract

“…For example, data center networks are often oversubscribed [8] and the disk throughput obtained by applications can fall well short of the disk hardware capabilities [22], [21]. A number of proposals improve I/O performance and could also decrease the absolute replication costs.…”

Section: Why Replication Is Problematicmentioning

confidence: 99%

“…Today's clusters are especially inefficient at handling large transfers due to economical constraints and architectural bottlenecks (e.g. oversubscribed networks [8], poor disk throughput [22]). For instance, in our evaluation we show that in the absence of failures, an I/O-intensive multi-job computation can double its running time when the replication factor is increased from 1 to 3.…”

Section: Introductionmentioning

confidence: 99%

RCMP: Enabling Efficient Recomputation Based Failure Resilience for Big Data Analytics

Dinu

2014

2014 IEEE 28th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-Data replication, the main failure resilience strategy used for big data analytics jobs, can be unnecessarily inefficient. In this paper we show how job recomputation can be made a first-order failure resilience strategy for big data analytics. The need for data replication can thus be significantly reduced. We present RCMP, a system that performs efficient job recomputation. RCMP can persist task outputs across jobs and leverage them to minimize the work performed during job recomputations. More importantly, RCMP addresses two important challenges that appear during job recomputations. The first is efficiently utilizing the available compute node parallelism. The second is dealing with hot-spots. RCMP handles both by switching to a finer-grained task scheduling granularity for recomputations. Our experiments show that RCMP's benefits hold across two different clusters, for job inputs as small as 40GB or as large as 1.2TB. Compared to RCMP, data replication is 30%-100% worse during failure-free periods. More importantly, by efficiently performing recomputations, RCMP is comparable or better even under single and double data loss events.

show abstract

The Hadoop distributed filesystem: Balancing portability and performance

Cited by 232 publications

References 14 publications

Social Media Analytics and Intelligence

Social Media Analytics and Intelligence

Adaptable I/O System based I/O Reduction for Improving the Performance of HDFS

RCMP: Enabling Efficient Recomputation Based Failure Resilience for Big Data Analytics

Contact Info

Product

Resources

About