2010 IEEE International Symposium on Performance Analysis of Systems &Amp; Software (ISPASS) 2010
DOI: 10.1109/ispass.2010.5452045
|View full text |Cite
|
Sign up to set email alerts
|

The Hadoop distributed filesystem: Balancing portability and performance

Abstract: Abstract-Hadoop is a popular open-source implementationof MapReduce for the analysis of large datasets. To manage storage resources across the cluster, Hadoop uses a distributed user-level filesystem. This filesystem -HDFS -is written in Java and designed for portability across heterogeneous hardware and software platforms. This paper analyzes the performance of HDFS and uncovers several performance issues. First, architectural bottlenecks exist in the Hadoop implementation that result in inefficient HDFS usag… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
108
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 232 publications
(108 citation statements)
references
References 14 publications
0
108
0
Order By: Relevance
“…The software framework is written in Java for distributed storage and distributed processing of very large datasets on computer clusters built from commodity hardware [37]. Hadoop is a highly scalable storage platform because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel.…”
Section: Scientific Programming Tools For Social Media Analysismentioning
confidence: 99%
“…The software framework is written in Java for distributed storage and distributed processing of very large datasets on computer clusters built from commodity hardware [37]. Hadoop is a highly scalable storage platform because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel.…”
Section: Scientific Programming Tools For Social Media Analysismentioning
confidence: 99%
“…HDFS plays a critical role in the Hadoop Ecosystem [13]. In this section, we focus on its runtime features.…”
Section: Hadoop Distributed File System (Hdfs)mentioning
confidence: 99%
“…For example, data center networks are often oversubscribed [8] and the disk throughput obtained by applications can fall well short of the disk hardware capabilities [22], [21]. A number of proposals improve I/O performance and could also decrease the absolute replication costs.…”
Section: Why Replication Is Problematicmentioning
confidence: 99%
“…Today's clusters are especially inefficient at handling large transfers due to economical constraints and architectural bottlenecks (e.g. oversubscribed networks [8], poor disk throughput [22]). For instance, in our evaluation we show that in the absence of failures, an I/O-intensive multi-job computation can double its running time when the replication factor is increased from 1 to 3.…”
Section: Introductionmentioning
confidence: 99%