Adaptive Caching in Big SQL using the HDFS Cache

Floratou, Avrilia; Megiddo, Nimrod; Potti, Navneet; Özcan, Fatma; Kale, Uday; Schmitz-Hermes, Jan

doi:10.1145/2987550.2987553

Cited by 34 publications

(27 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, PACMan does not allow applications to specify hot data in memory for subsequent efficient accesses and does not implement cache admission policies. Big SQL [16] is an SQL-on-hadoop system that utilizes HDFS cache for caching table partitions. Big SQL presents two algorithms, namely SLRU-K and EXD, that explore the tradeoff of caching objects based on recency and frequency of data accesses.…”

Section: Distributed File Systems and Tieringmentioning

confidence: 99%

“…All these decisions will be handled through pluggable downgrade and upgrade policies, elaborated in Sections 4-6. The decisions are made at the granularity of files (rather than blocks) since previous work [5,16] has shown that performance improvement is attained only when entire files are present in a higher tier (called the "all-or-nothing" property in [5]).…”

Section: Adaptive Tiered Storage Managementmentioning

confidence: 99%

“…Otherwise, LFU-F selects the LFU file from Pnew. EXD (Exponential Decay) explores the tradeoff between recency and frequency in data accesses in Big SQL [16]. In particular, it selects the file with the lowest weight W computed using the following formula: The parameter α determines the weight of frequency vs. recency and it is set to 1.16 * 10 −8 based on [16].…”

Section: Which File To Downgradementioning

confidence: 99%

“…EXD (Exponential Decay) explores the tradeoff between recency and frequency in data accesses in Big SQL [16]. In particular, it selects the file with the lowest weight W computed using the following formula: The parameter α determines the weight of frequency vs. recency and it is set to 1.16 * 10 −8 based on [16]. XGB (XGBoost-based Modeling) incrementally trains and utilizes an XGBoost model (recall Section 4) for deciding which file will not be accessed in the distant future.…”

Section: Which File To Downgradementioning

confidence: 99%

See 3 more Smart Citations

Automating distributed tiered storage management in cluster computing

Herodotou

Kakoulli

2019

Proc. VLDB Endow.

View full text Add to dashboard Cite

Data-intensive platforms such as Hadoop and Spark are routinely used to process massive amounts of data residing on distributed file systems like HDFS. Increasing memory sizes and new hardware technologies (e.g., NVRAM, SSDs) have recently led to the introduction of storage tiering in such settings. However, users are now burdened with the additional complexity of managing the multiple storage tiers and the data residing on them while trying to optimize their workloads. In this paper, we develop a general framework for automatically moving data across the available storage tiers in distributed file systems. Moreover, we employ machine learning for tracking and predicting file access patterns, which we use to decide when and which data to move up or down the storage tiers for increasing system performance. Our approach uses incremental learning to dynamically refine the models with new file accesses, allowing them to naturally adjust and adapt to workload changes over time. Our extensive evaluation using realistic workloads derived from Facebook and CMU traces compares our approach with several other policies and showcases significant benefits in terms of both workload performance and cluster efficiency.

show abstract

Section: Distributed File Systems and Tieringmentioning

confidence: 99%

Section: Adaptive Tiered Storage Managementmentioning

confidence: 99%

Section: Which File To Downgradementioning

confidence: 99%

Section: Which File To Downgradementioning

confidence: 99%

See 2 more Smart Citations

Automating distributed tiered storage management in cluster computing

Herodotou

Kakoulli

2019

Proc. VLDB Endow.

View full text Add to dashboard Cite

show abstract

“…they perform within a constant factor k of the optimal offline algorithm. More recently, cost-based algorithms have been used for adaptive caching in Hadoop-based analytics systems [19]. Unlike algorithms for data items with identical costs, there is no provably optimal, polynomial time cache eviction algorithm [12].…”

Section: Related Workmentioning

confidence: 99%

ReCache

2017

View full text Add to dashboard Cite

As data continues to be generated at exponentially growing rates in heterogeneous formats, fast analytics to extract meaningful information is becoming increasingly important. Systems widely use in-memory caching as one of their primary techniques to speed up data analytics. However, caches in data analytics systems cannot rely on simple caching policies and a fixed data layout to achieve good performance. Different datasets and workloads require different layouts and policies to achieve optimal performance. This paper presents ReCache, a cache-based performance accelerator that is reactive to the cost and heterogeneity of diverse raw data formats. Using timing measurements of caching operations and selection operators in a query plan, ReCache accounts for the widely varying costs of reading, parsing, and caching data in nested and tabular formats. Combining these measurements with information about frequently accessed data fields in the workload, ReCache automatically decides whether a nested or relational columnoriented layout would lead to better query performance. Furthermore, ReCache keeps track of commonly utilized operators to make informed cache admission and eviction decisions. Experiments on synthetic and real-world datasets show that our caching techniques decrease caching overhead for individual queries by an average of 59%. Furthermore, over the entire workload, ReCache reduces execution time by 19-75% compared to existing techniques. PVLDB Reference Format:Tahir Azim, Manos Karpathiotakis and Anastasia Ailamaki. ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data. PVLDB, 11(3): xxxx-yyyy, 2017.

show abstract

Adaptive Hierarchical Cache Management for Cloud RAN and Multi-access Edge Computing in 5G Networks

Rajendiran

Moh

2019

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Adaptive Caching in Big SQL using the HDFS Cache

Cited by 34 publications

References 15 publications

Automating distributed tiered storage management in cluster computing

Automating distributed tiered storage management in cluster computing

ReCache

Adaptive Hierarchical Cache Management for Cloud RAN and Multi-access Edge Computing in 5G Networks

Contact Info

Product

Resources

About