AFrame: Extending DataFrames for Large-Scale Modern Data Analysis

Sinthong, Phanwadee; Carey, Michael J.

doi:10.1109/bigdata47090.2019.9006303

Cited by 10 publications

(8 citation statements)

References 12 publications

(4 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The two presented approaches, SDBL via Python and SDB via PostgreSQL also differ with respect to their memory model. When using SDBL via Python, the primary data working object is the pandas dataframe, which albeit its strong resemblance to a relational table in terms of its structure, is nevertheless non-persistent [73]. In other words, it is up to the programmer to choose a suitable data format (such as the feather package which we used) in order to save the dataframe onto disk for later use.…”

Section: Discussionmentioning

confidence: 99%

A Simple Semantic-Based Data Storage Layout for Querying Point Clouds

El-Mahgary

Virtanen

Hyyppä

2020

IJGI

View full text Add to dashboard Cite

The importance of being able to separate the semantics from the actual (X,Y,Z) coordinates in a point cloud has been actively brought up in recent research. However, there is still no widely used or accepted data layout paradigm on how to efficiently store and manage such semantic point cloud data. In this paper, we present a simple data layout that makes use the semantics and that allows for quick queries. The underlying idea is especially suited for a programming approach (e.g., queries programmed via Python) but we also present an even simpler implementation of the underlying technique on a well known relational database management system (RDBMS), namely, PostgreSQL. The obtained query results suggest that the presented approach can be successfully used to handle point and range queries on large points clouds.

show abstract

Section: Discussionmentioning

confidence: 99%

A Simple Semantic-Based Data Storage Layout for Querying Point Clouds

El-Mahgary

Virtanen

Hyyppä

2020

IJGI

View full text Add to dashboard Cite

show abstract

“…In the research community, there are multiple notable papers that have tackled dataframe optimization through vastly different approaches. Sinthong et al propose AFrame, a dataframe system implemented on top of AsterixDB by translating dataframe APIs into SQL++ queries that are supported by AsterixDB [15]. Another work by Yan et al aims to accelerate EDA with dataframes by "auto-suggesting" data exploration op-erations [17].…”

Section: Related Workmentioning

confidence: 99%

Enhancing the Interactivity of Dataframe Queries by Leveraging Think Time

Xin,

Petersohn,

Tang

et al. 2021

Preprint

View full text Add to dashboard Cite

“…AFrame [24,25] is a library that provides a Pandas DataFrame [19] based syntax to interact with data in Apache AsterixDB. AFrame targets data scientists who are already familiar with Pandas DataFrames.…”

Section: B Aframementioning

confidence: 99%

“…One is the total runtime, which includes both the DataFrame creation time and the expression runtime, and the other is the expression-only runtime. This is done to reflect the impact of the schema inferencing process which can be time-consuming for some DataFrame libraries [24,25]. Also, depending on the nature of a given analysis, the DataFrame creation time can dominate the actual expression evaluation time.…”

Section: A Dataframe Benchmarkmentioning

confidence: 99%

See 1 more Smart Citation

PolyFrame: A Retargetable Query-based Approach to Scaling DataFrames (Extended Version)

Sinthong¹,

Carey²

2020

Preprint

Self Cite

View full text Add to dashboard Cite

In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision making and applications. Scaling data analysis, possibly including the application of custom machine learning models, to large volumes of data requires the utilization of distributed frameworks. This can lead to serious technical challenges for data analysts and reduce their productivity. AFrame, a Python data analytics library, is implemented as a layer on top of Apache AsterixDB, addressing these issues by incorporating the data scientists' development environment and transparently scaling out the evaluation of analytical operations through a Big Data management system. While AFrame is able to leverage data management facilities (e.g., indexes and query optimization) and allows users to interact with a very large volume of data, the initial version only generated SQL++ queries and only operated against Apache AsterixDB. In this work, we describe a new design that retargets AFrame's incremental query formation to other query-based database systems as well, making it more flexible for deployment against other data management systems with composable query languages.

show abstract

AFrame: Extending DataFrames for Large-Scale Modern Data Analysis

Cited by 10 publications

References 12 publications

A Simple Semantic-Based Data Storage Layout for Querying Point Clouds

A Simple Semantic-Based Data Storage Layout for Querying Point Clouds

Enhancing the Interactivity of Dataframe Queries by Leveraging Think Time

PolyFrame: A Retargetable Query-based Approach to Scaling DataFrames (Extended Version)

Contact Info

Product

Resources

About