Column-oriented storage techniques for MapReduce

Floratou, Avrilia; Patel, Jignesh M.; Shekita, Eugene J.; Tata, Sandeep

doi:10.14778/1988776.1988778

Cited by 107 publications

(63 citation statements)

References 15 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is also designed to run sequentially in order to improve disk access performance. To store the result of this aggregation process in a compressed big file, a new data structure was designed using the Avro serialization framework [8,10,11,16].…”

Section: Designmentioning

confidence: 99%

High Performance CDR Processing with MapReduce

Agung

Kistijantoro

2016

J. ICT Res. Appl.

View full text Add to dashboard Cite

Abstract.A call detail record (CDR) is a data record produced by telecommunication equipment consisting of call detail transaction logs. It contains valuable information for many purposes in several domains, such as billing, fraud detection and analytical purposes. However, in the real world these needs face a big data challenge. Billions of CDRs are generated every day and the processing systems are expected to deliver results in a timely manner. The capacity of our current production system is not enough to meet these needs. Therefore a better performing system based on MapReduce and running on Hadoop cluster was designed and implemented. This paper presents an analysis of the previous system and the design and implementation of the new system, called MS2. In this paper also empirical evidence is provided to demonstrate the efficiency and linearity of MS2. Tests have shown that MS2 reduces overhead by 44% and speeds up performance nearly twice compared to the previous system. From benchmarking with several related technologies in large-scale data processing, MS2 was also shown to perform better in the case of CDR batch processing. When it runs on a cluster consisting of eight CPU cores and two conventional disks, MS2 is able to process 67,000 CDRs/second.

show abstract

Section: Designmentioning

confidence: 99%

High Performance CDR Processing with MapReduce

Agung

Kistijantoro

2016

J. ICT Res. Appl.

View full text Add to dashboard Cite

show abstract

“…CIF [44] proposed a column-oriented, binary storage format for HDFS aiming to improve its performance. The idea is that each file is first horizontally partitioned in splits, and each split is stored in a subdirectory.…”

Section: Data Layoutsmentioning

confidence: 99%

“…Map-Reduce-Merge [29] N/A N/A N/A N/A N/A Map-Join-Reduce [58] N/A N/A N/A N/A N/A Afrati et al [5,6] No No Hash-based "share"-based No Repartition join [18] Yes No Hash-based No No Broadcast join [18] Yes No Broadcast Broadcast R No Semi-join [18] Yes No Broadcast Broadcast No Per-split semi-join [18] Yes Hadoop++ [36] No, based on using UDFs HAIL [37] Yes, changes the RecordReader and a few UDFs CoHadoop [41] Yes, extends HDFS and adds metadata to NameNode Llama [74] No, runs on top of Hadoop Cheetah [28] No, runs on top of Hadoop RCFile [50] No changes to Hadoop, implements certain interfaces CIF [44] No changes to Hadoop core, leverages extensibility features Trojan layouts [59] Yes, introduces Trojan HDFS (among others) MRShare [83] Yes, modifies map outputs with tags and writes to multiple output files on the reduce side ReStore [40] Yes, extends the JobControlCompiler of Pig Sharing scans [11] Independent of system Silva et al [95] No, integrated into SCOPE Incoop [17] Yes, new file system, contraction phase, and memoization-aware scheduler Li et al [71,72] Yes, modifies the internals of Hadoop by replacing key components Grover et al [47] Yes, introduces dynamic job and Input Provider EARL [67] Yes, RecordReader and Reduce classes are modified, and simple extension to Hadoop to support dynamic input and efficient resampling Top-k queries [38] Yes, changes data placement and builds statistics RanKloud [24] Yes, integrates its execution engine into Hadoop and uses local B+Tree indexes HaLoop [22,23] Yes, use of caching and changes to the scheduler MapReduce online [30] Yes, communication between Map and Reduce, and to JobTracker and TaskTracker NOVA [85] No, runs on top of Pig and Hadoop Twister [39] Adopts an ...…”

Section: Join Typementioning

confidence: 99%

A survey of large-scale analytical query processing in MapReduce

2013

View full text Add to dashboard Cite

Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of data that need to be extracted, processed, and analyzed in a timely fashion. Arguably the most popular framework for contemporary large-scale data analytics is MapReduce, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and flexibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has given rise to a significant body of research that aim at improving its efficiency, while maintaining its desirable properties.This survey aims to review the state-of-the-art in improving the performance of parallel query processing using MapReduce. A set of the most significant weaknesses and limitations of MapReduce is discussed at a high level, along with solving techniques. A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target. Based on the proposed taxonomy, a clas-C. Doulkeridis

show abstract

“…Based on the collected bids, the scheduler allocates subqueries to the workers (5). Next, the master starts remote execution of the subqueries in parallel on the workers (6). Each worker requests (7) and obtains (8) from the master just-intime replicas of data needed by the query.…”

Section: Architecturementioning

confidence: 99%

“…Its performance, shown to be sub-optimal in the database context [16], has been recently boosted by adding features and developing optimization frameworks. Often, solutions are found in well known techniques from database world, such as indexing [12] and column-oriented storage [6].…”

Section: Related Workmentioning

confidence: 99%

Just-In-Time Data Distribution for Analytical Query Processing

Ivanova

Kersten

Groffen

2012

Advances in Databases and Information Systems

View full text Add to dashboard Cite

Abstract. Distributed processing commonly requires data spread across machines using a priori static or hash-based data allocation. In this paper, we explore an alternative approach that starts from a master node in control of the complete database, and a variable number of worker nodes for delegated query processing. Data is shipped just-in-time to the worker nodes using a need to know policy, and is being reused, if possible, in subsequent queries. A bidding mechanism among the workers yields a scheduling with the most efficient reuse of previously shipped data, minimizing the data transfer costs. Just-in-time data shipment allows our system to benefit from locally available idle resources to boost overall performance. The system is maintenance-free and allocation is fully transparent to users. Our experiments show that the proposed adaptive distributed architecture is a viable and flexible alternative for small scale MapReduce-type of settings.

show abstract

Column-oriented storage techniques for MapReduce

Cited by 107 publications

References 15 publications

High Performance CDR Processing with MapReduce

High Performance CDR Processing with MapReduce

A survey of large-scale analytical query processing in MapReduce

Just-In-Time Data Distribution for Analytical Query Processing

Contact Info

Product

Resources

About