Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay off. We take a different approach. We make elephants aggressive; only this will make them very fast.We propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six clusters including physical and EC2 clusters of up to 100 nodes. A series of scalability experiments also demonstrates the superiority of HAIL.
Several research works have focused on supporting index access in MapReduce systems. These works have allowed users to significantly speed up selective MapReduce jobs by orders of magnitude. However, all these proposals require users to create indexes upfront, which might be a difficult task in certain applications (such as in scientific and social applications) where workloads are evolving or hard to predict. To overcome this problem, we propose LIAH (Lazy Indexing and Adaptivity in Hadoop), a parallel, adaptive approach for indexing at minimal costs for MapReduce systems. The main idea of LIAH is to automatically and incrementally adapt to users' workloads by creating clustered indexes on HDFS data blocks as a byproduct of executing MapReduce jobs. Besides distributing indexing efforts over multiple computing nodes, LIAH also parallelises indexing with both map tasks computation and disk I/O. All this without any additional data copy in main memory and with minimal synchronisation. The beauty of LIAH is that it piggybacks index creation on map tasks, which read relevant data from disk to main memory anyways. Hence, LIAH does not introduce any additional read I/O-costs and exploit free CPU cycles. As a result and in contrast to existing adaptive indexing works, LIAH has a very low (or invisible) indexing overhead, usually for the very first job. Still, LIAH can quickly converge to a complete index, i.e. all HDFS data blocks are indexed. Especially, LIAH can trade early job runtime improvements with fast complete index convergence. We compare LIAH with HAIL, a state-of-the-art indexing technique, as well as with standard Hadoop with respect to indexing overhead and workload performance. In terms of indexing overhead, LIAH can completely index a dataset as a byproduct of only four MapReduce jobs while incurring a low overhead of 11% over HAIL for the very first MapReduce job only. In terms of workload performance, our results show that LIAH outperforms Hadoop by up to a factor of 52 and HAIL by up to a factor of 24.
The Hadoop Distributed Filesystem has become the de-facto standard for storing large datasets in data management systems such as Hadoop MapReduce, Hive, and Stratosphere. Though HDFS was originally designed to support scan-oriented operations, recently several techniques for HDFS have been developed to allow for efficient indexing. One of these indexing techniques is aggressive indexing, i.e. HDFS replicas are immediately indexed at upload time before touching any disk -creating multiple clustered indexes almost for free on the way. A second technique is adaptive indexing, i.e. HDFS blocks are only indexed on demand as a side effect of query processing. Though these techniques provide impressive speed-ups in terms of query processing, they totally ignored the costs involved with storing a large number of replicas of a particular dataset. The HDFS-variants of adaptive indexing were already designed to leverage the natural redundancy that comes with HDFS, typically storing a dataset three times anyway. However, it is questionable whether storing an unlimited number of replicas for a dataset is a practical solution. Therefore, this paper is the first to analyze adaptive indexing under a space constraint, i.e. we assume that indexes are adaptively created and deleted. We coin this problem the Adaptive Index Replacement problem. We present a new algorithm to solve the online AIR problem called LeastExpectedBenefit-K and compare it with several existing state-of-the-art online Index Selection algorithms. We present a comprehensive study evaluating ten different algorithms. Our results show that our algorithm LEB-2 is efficient and robust and a good choice in practice.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.