2011
DOI: 10.14778/1988776.1988778
|View full text |Cite
|
Sign up to set email alerts
|

Column-oriented storage techniques for MapReduce

Abstract: Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a MapReduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming A… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
61
0

Year Published

2012
2012
2020
2020

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 107 publications
(63 citation statements)
references
References 15 publications
(33 reference statements)
2
61
0
Order By: Relevance
“…It is also designed to run sequentially in order to improve disk access performance. To store the result of this aggregation process in a compressed big file, a new data structure was designed using the Avro serialization framework [8,10,11,16].…”
Section: Designmentioning
confidence: 99%
“…It is also designed to run sequentially in order to improve disk access performance. To store the result of this aggregation process in a compressed big file, a new data structure was designed using the Avro serialization framework [8,10,11,16].…”
Section: Designmentioning
confidence: 99%
“…CIF [44] proposed a column-oriented, binary storage format for HDFS aiming to improve its performance. The idea is that each file is first horizontally partitioned in splits, and each split is stored in a subdirectory.…”
Section: Data Layoutsmentioning
confidence: 99%
“…Map-Reduce-Merge [29] N/A N/A N/A N/A N/A Map-Join-Reduce [58] N/A N/A N/A N/A N/A Afrati et al [5,6] No No Hash-based "share"-based No Repartition join [18] Yes No Hash-based No No Broadcast join [18] Yes No Broadcast Broadcast R No Semi-join [18] Yes No Broadcast Broadcast No Per-split semi-join [18] Yes Hadoop++ [36] No, based on using UDFs HAIL [37] Yes, changes the RecordReader and a few UDFs CoHadoop [41] Yes, extends HDFS and adds metadata to NameNode Llama [74] No, runs on top of Hadoop Cheetah [28] No, runs on top of Hadoop RCFile [50] No changes to Hadoop, implements certain interfaces CIF [44] No changes to Hadoop core, leverages extensibility features Trojan layouts [59] Yes, introduces Trojan HDFS (among others) MRShare [83] Yes, modifies map outputs with tags and writes to multiple output files on the reduce side ReStore [40] Yes, extends the JobControlCompiler of Pig Sharing scans [11] Independent of system Silva et al [95] No, integrated into SCOPE Incoop [17] Yes, new file system, contraction phase, and memoization-aware scheduler Li et al [71,72] Yes, modifies the internals of Hadoop by replacing key components Grover et al [47] Yes, introduces dynamic job and Input Provider EARL [67] Yes, RecordReader and Reduce classes are modified, and simple extension to Hadoop to support dynamic input and efficient resampling Top-k queries [38] Yes, changes data placement and builds statistics RanKloud [24] Yes, integrates its execution engine into Hadoop and uses local B+Tree indexes HaLoop [22,23] Yes, use of caching and changes to the scheduler MapReduce online [30] Yes, communication between Map and Reduce, and to JobTracker and TaskTracker NOVA [85] No, runs on top of Pig and Hadoop Twister [39] Adopts an ...…”
Section: Join Typementioning
confidence: 99%
“…Based on the collected bids, the scheduler allocates subqueries to the workers (5). Next, the master starts remote execution of the subqueries in parallel on the workers (6). Each worker requests (7) and obtains (8) from the master just-intime replicas of data needed by the query.…”
Section: Architecturementioning
confidence: 99%
“…Its performance, shown to be sub-optimal in the database context [16], has been recently boosted by adding features and developing optimization frameworks. Often, solutions are found in well known techniques from database world, such as indexing [12] and column-oriented storage [6].…”
Section: Related Workmentioning
confidence: 99%