Jens Dittrich scite author profile

This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. A popular data processing engine for big data is Hadoop MapReduce. Early versions of Hadoop MapReduce suffered from severe performance problems. Today, this is becoming history. There are many techniques that can be used with Hadoop MapReduce jobs to boost performance by orders of magnitude. In this tutorial we teach such techniques. First, we will briefly familiarize the audience with Hadoop MapReduce and motivate its use for big data processing. Then, we will focus on different data management techniques, going from job optimization to physical data organization like data layouts and indexes. Throughout this tutorial, we will highlight the similarities and differences between Hadoop MapReduce and Parallel DBMS. Furthermore, we will point out unresolved research problems and open issues.

show abstract

An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory

Schuh

Chen

Dittrich

2016

View full text Add to dashboard Cite

Only aggressive elephants are fast elephants

et al. 2012

View full text Add to dashboard Cite

Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay off. We take a different approach. We make elephants aggressive; only this will make them very fast.We propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six clusters including physical and EC2 clusters of up to 100 nodes. A series of scalability experiments also demonstrates the superiority of HAIL.

show abstract

Data redundancy and duplicate detection in spatial join processing

Dittrich

Seeger

View full text Add to dashboard Cite

The Partitioned Based Spatial-Merge Join (PBSM) of Patel and DeWitt and the Size Separation Spatial Join (S 3 J) of Koudas and Sevcik are considered to be among the most efficient methods for processing spatial (intersection) joins on two or more spatial relations. Both methods do not assume the presence of pre-existing spatial indices on the relations. In this paper, we propose several improvements of these join algorithms. In particular, we deal with the impact of data redundancy and duplicate detection on the performance of theses methods. For PBSM, we present a simple and inexpensive on-line method to detect duplicates in the response set. There is no need anymore for eliminating duplicates in a final sorting phase as it has been suggested originally. We also investigate in this paper the impact of different internal algorithms on the total runtime of PBSM. For S 3 J, we break with the original design goal and introduce controlled redundancy of data objects. Results of a large set of experiments with real datasets reveal that our suggested modifications of PBSM and S 3 J result in substantial performance improvements where PBSM is generally superior to S 3 J.

show abstract

Blurring the Lines between Blockchains and Database Systems

et al. 2019

View full text Add to dashboard Cite

Within the last few years, a countless number of blockchain systems have emerged on the market, each one claiming to revolutionize the way of distributed transaction processing in one way or the other. Many blockchain features, such as byzantine fault tolerance, are indeed valuable additions in modern environments. However, despite all the hype around the technology, many of the challenges that blockchain systems have to face are fundamental transaction management problems. These are largely shared with traditional database systems, which have been around for decades already. These similarities become especially visible for systems, that blur the lines between blockchain systems and classical database systems. A great example of this is Hyperledger Fabric, an open-source permissioned blockchain system under development by IBM. By implementing parallel transaction processing, Fabric's work ow is highly motivated by optimistic concurrency control mechanisms in classical database systems. This raises two questions: (1) Which conceptual similarities and di erences do actually exist between a system such as Fabric and a classical distributed database system? (2) Is it possible to improve on the performance of Fabric by transitioning technology from the database world to blockchains and thus blurring the lines between these two types of systems even further? To tackle these questions, we rst explore Fabric from the perspective of database research, where we observe weaknesses in the transaction pipeline. We then solve these issues by transitioning well-understood database concepts to Fabric, namely transaction reordering

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jens Dittrich

Efficient big data processing in Hadoop MapReduce

An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory

Only aggressive elephants are fast elephants

Data redundancy and duplicate detection in spatial join processing

Blurring the Lines between Blockchains and Database Systems

Contact Info

Product

Resources

About