Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously, by deploying Data-Intensive Workflows (DIWs). These DIWs of different users share many common tasks (i.e, 50-80%), which can be materialized and reused in future executions. Materializing the output of such common tasks improves the overall processing time of DIWs and also saves computational resources. Current solutions for materialization store data on Distributed File Systems by using a fixed storage format. However, a fixed choice is not the optimal one for every situation. Specifically, different layouts (i.e., horizontal, vertical or hybrid) have a huge impact on execution, according to the access patterns of the subsequent operations.In this paper, we present a cost-based approach that helps deciding the most appropriate storage format in every situation. A generic cost-based framework that selects the best format by considering the three main layouts is presented. Then, we use our framework to instantiate cost models for specific Hadoop storage formats (namely SequenceFile, Avro and Parquet), and test it with two standard benchmark suits. Our solution gives on average 1.33x
In various fields, scientific article publication is a measure of productivity and in many occasions it is used as a critical factor for evaluating researchers. Therefore, a lot of time is dedicated to writing articles that are then submitted for publication in journals. Nevertheless, the publication process in general and the review process in particular tend to be rather slow. This is the case for instance of Computer Science (CS) journals. Moreover, the process typically lacks in transparency, where information about the duration of the review process is at best provided in an aggregated manner, if made available at all.In this paper, we develop a framework as a step towards bringing more reliable data with respect to review duration. Based on this framework, we implement a tool -Journal Response Time (JRT), that allows for automatically extracting the review process data and helps researchers to find the average response times of journals, which can be used to study the duration of CS journals' peer review process. The information is extracted as metadata from the published articles, when available. This study reveals that the response times publicly provided by publishers differ from the actual values obtained by JRT (e.g., for ten selected journals the average duration reported by publishers deviates by more than 500% from the actual average value calculated from the data inside the articles), which we suspect could be from the fact that, when calculating the aggregated values, publishers consider the review time of rejected articles too (including quick deskrejections that do not require reviewers).
Ad-hoc analysis implies processing data in near real-time. Thus, raw data (i.e., neither normalized nor transformed) is typically dumped into a distributed engine, where it is generally stored into a hybrid layout. Hybrid layouts divide data into horizontal partitions and inside each partition, data are stored vertically. They keep statistics for each horizontal partition and also support encoding (i.e., dictionary) and compression to reduce the size of the data. Their built-in support for many ad-hoc operations (i.e., selection, projection, aggregation, etc.) makes hybrid layouts the best choice for most operations. Horizontal partition and dictionary sizes of hybrid layouts are configurable and can directly impact the performance of analytical queries. Hence, their default configuration cannot be expected to be optimal for all scenarios. In this paper, we present ATUN-HL (Auto TUNing Hybrid Layouts), which based on a cost model and given the workload and the characteristics of data, finds the best values for these parameters. We prototyped ATUN-HL for Apache Parquet, which is an open source implementation of hybrid layouts in Hadoop Distributed File System, to show its effectiveness. Our experimental evaluation shows that ATUN-HL provides on average 85% of all the potential performance improvement, and 1.2x average speedup against default configuration.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.