A bloat-aware design for big data applications

Bu, Yingyi; Borkar, Vinayak; Xu, Guoqing; Carey, Michael J.

doi:10.1145/2491894.2466485

Cited by 39 publications

(29 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The raw dataset is a text file with each line containing one data point. Hence the first UDF is a map function, which extracts the data points and store them into a set of DenseVector objects and (lines [12][13][14][15][16]). An additional LabeledPoint object is created for each data point to package its feature vector and label value together.…”

Section: Motivating Examplementioning

confidence: 99%

See 1 more Smart Citation

Lifetime-based memory management for distributed data processing systems

Shi

Zhou

et al. 2016

Proc. VLDB Endow.

View full text Add to dashboard Cite

In-memory caching of intermediate data and eager combining of data in shuffle buffers have been shown to be very effective in minimizing the re-computation and I/O cost in distributed data processing systems like Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap, which may quickly saturate the garbage collector, especially when handling a large dataset, and hence would limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the userdefined functions and data types, obtains the expected lifetime of the data objects, and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca, a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. An extensive experimental study using both synthetic and real datasets shows that, in comparing to Spark, Deca is able to 1) reduce the garbage collection time by up to 99.9%, 2) to achieve up to 22.7x speed up in terms of execution time in cases without data spilling and 41.6x speedup in cases with data spilling, and 3) to consume up to 46.6% less memory.

show abstract

Section: Motivating Examplementioning

confidence: 99%

“…Based on this order, the size-type of each UDT is determined by its field that has the highest variability (lines [12][13][14][15][16][17][18][19][20]. Furthermore, each field's final sizetype is determined by the type with the highest variability in its type-set.…”

Section: Local Classification Analysismentioning

confidence: 99%

Lifetime-based memory management for distributed data processing systems

Shi

Zhou

et al. 2016

Proc. VLDB Endow.

View full text Add to dashboard Cite

show abstract

“…This means that data are accessed at the binary level rather than as objects. This approach was used since creating and collecting language objects is often a cause of performance bottlenecks …”

Section: Preliminaries—apache Asterixdbmentioning

confidence: 99%

Robust and efficient memory management in Apache AsterixDB

Kim

Behm

Blow³

et al. 2020

Softw Pract Exp

Self Cite

View full text Add to dashboard Cite

Traditional relational database systems handle data by dividing their memory into sections such as a buffer cache and working memory, assigning a memory budget to each section to efficiently manage a limited amount of overall memory. They also assign memory budgets to memory-intensive operators such as sorts and joins and control the allocation of memory to these operators; each memory-intensive operator attempts to maximize its memory usage to reduce disk I/O cost. Implementing such memory-intensive operators requires a careful design and application of appropriate algorithms that properly utilize memory. Today's Big Data management systems need the ability to handle large amounts of data similarly, as it is unrealistic to assume that truly big data will fit into memory. In this article, we share our memory management experiences in Apache AsterixDB, an open-source Big Data management software platform that scales out horizontally on shared-nothing commodity computing clusters.We describe the implementation of AsterixDB's memory-intensive operators and their designs related to memory management. We also discuss memory management at the global (cluster) level. We conducted an experimental study using several synthetic and real datasets to explore the impact of this work. We believe that future Big Data management system builders can benefit from these experiences.

show abstract

“…Comprehensive studies across many contemporary Big Data systems [18] confirm that these overheads lead to significantly reduced scalability-e.g., applications crash with OutOfMemoryError, although the size of the processed dataset is much smaller than the heap size-as well as exceedingly high memory management costs-e.g., the GC time accounts for up to 50% of the overall execution time. Despite the many optimizations [6, 7, 16, 19, 21, 23-25, 33, 38, 41, 45, 48, 49, 52, 54-57, 60, 61, 72, 73, 76] from various research communities, poor performance inherent with the managed runtime remains a serious problem that can devaluate these domain-specific optimization techniques.…”

Section: Motivationmentioning

confidence: 99%

Facade

Nguyen

Wang

et al. 2015

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Self Cite

View full text Add to dashboard Cite

The past decade has witnessed the increasing demands on data-driven business intelligence that led to the proliferation of data-intensive applications. A managed object-oriented programming language such as Java is often the developer's choice for implementing such applications, due to its quick development cycle and rich community resource. While the use of such languages makes programming easier, their automated memory management comes at a cost. When the managed runtime meets Big Data, this cost is significantly magnified and becomes a scalability-prohibiting bottleneck.This paper presents a novel compiler framework, called FACADE, that can generate highly-efficient data manipulation code by automatically transforming the data path of an existing Big Data application. The key treatment is that in the generated code, the number of runtime heap objects created for data types in each thread is (almost) statically bounded, leading to significantly reduced memory management cost and improved scalability. We have implemented FACADE and used it to transform 7 common applications on 3 real-world, already well-optimized Big Data frameworks: GraphChi, Hyracks, and GPS. Our experimental results are very positive: the generated programs have (1) achieved a 3%-48% execution time reduction and an up to 88× GC reduction; (2) consumed up to 50% less memory, and (3) scaled to much larger datasets.

show abstract

A bloat-aware design for big data applications

Cited by 39 publications

References 33 publications

Lifetime-based memory management for distributed data processing systems

Lifetime-based memory management for distributed data processing systems

Robust and efficient memory management in Apache AsterixDB

Facade

Contact Info

Product

Resources

About