Parallelizing the Data Cube

Dehne, Frank; Eavis, Todd; Hambrusch, Susanne E.; Rau-Chaplin, Andrew

doi:10.1007/3-540-44503-x_9

Cited by 18 publications

(15 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They are divided into two main groups: work partitioning [11], [12] and data partitioning [13], [14]. In work partitioning, each processor (or node) of the cluster computes aggregates for a set of one or many views independently.…”

Section: A Rollup In Parallel Databasesmentioning

confidence: 99%

Efficient and Self-Balanced ROLLUP Aggregates for Large-Scale Data Summarization

Phan

Hoang-Xuan

Dell’Amico³

et al. 2015

2015 IEEE International Congress on Big Data

View full text Add to dashboard Cite

Abstract-Data summarization queries that compute aggregates by grouping datasets across several dimensions are essential to help users make sense of very large datasets. In this work, we focus on ROLLUP, an important operator that has been recently added to the Hadoop MapReduce ecosystem. However, its current implementation suffers from very large communication costs, leading to inefficient executions. We thus proceed with the design of a new ROLLUP operator for highlevel languages. Our operator is self-optimizing, which means that it automatically performs load-balancing and determines a suitable operating point to achieve the highest performance. We have implemented our ROLLUP operator for Apache Pig, a popular high-level language in the Hadoop ecosystem. Our experimental results, obtained on both synthetic and real datasets, indicate that our new operator outperforms the current ROLLUP implementation in Pig by at least 50%.

show abstract

Section: A Rollup In Parallel Databasesmentioning

confidence: 99%

Efficient and Self-Balanced ROLLUP Aggregates for Large-Scale Data Summarization

Phan

Hoang-Xuan

Dell’Amico³

et al. 2015

2015 IEEE International Congress on Big Data

View full text Add to dashboard Cite

show abstract

“…Building the data cube can be a massive computational task, and significant research has been published on sequential and parallel data cube construction methods (e.g. [4], [5], [3], [6], [7], [8]). However, the traditional static data cube approach has several disadvantages.…”

Section: A Backgroundmentioning

confidence: 99%

A distributed tree data structure for real-time OLAP on cloud architectures

Dehne

Kong

Rau-Chaplin

et al. 2013

2013 IEEE International Conference on Big Data

View full text Add to dashboard Cite

Abstract-In contrast to queries for on-line transaction processing (OLTP) systems that typically access only a small portion of a database, OLAP queries may need to aggregate large portions of a database which often leads to performance issues. In this paper we introduce CR-OLAP, a Cloud based Real-time OLAP system based on a new distributed index structure for OLAP, the distributed PDCR tree, that utilizes a cloud infrastructure consisting of (m + 1) multi-core processors. With increasing database size, CR-OLAP dynamically increases m to maintain performance. Our distributed PDCR tree data structure supports multiple dimension hierarchies and efficient query processing on the elaborate dimension hierarchies which are so central to OLAP systems. It is particularly efficient for complex OLAP queries that need to aggregate large portions of the data warehouse, such as "report the total sales in all stores located in California and New York during the months February-May of all years". We evaluated CR-OLAP on the Amazon EC2 cloud, using the TPC-DS benchmark data set. The tests demonstrate that CR-OLAP scales well with increasing number of processors, even for complex queries. For example, on an Amazon EC2 cloud instance with eight processors, for a TPC-DS OLAP query stream on a data warehouse with 80 million tuples where every OLAP query aggregates more than 50% of the database, CR-OLAP achieved a query latency of 0.3 seconds which can be considered a real time response.

show abstract

“…that together form a key. Measures are typically numeric elements like packet The pre-computation of the different views of a data cube (i.e., the forming of aggregates for every combination of GROUP BY attributes) is critical to improving the response time of the queries [18]. However, in many cases not all views are needed for decision making, therefore it is advantageous to use only selected views.…”

Section: The Data Cubementioning

confidence: 99%

Substantiating Anomalies In Wireless Networks Using Group Outlier Scores

Sithirasenan¹,

Muthukkumarasamy²

2011

JSW

View full text Add to dashboard Cite

Huge amounts of network traces can be collected from today’s busy computer networks. Analyzing these traces could pave the way to detect unusual conditions and/or other anomalies. Presently, due to the lack of effective substantiating mechanisms intrusion detection systems often exhibit numerous false positives or negatives. The efficiency of a network intrusion detection system (NIDS) depends very much on detecting and effectively validating the detected anomalies. Furthermore, most NIDSs do not have proven mechanisms that will easily accommodate legitimate dynamic changes. Achieving dynamic adaptation in real time has been a long standing desire for effective intrusion detection and prevention. Real time detection of outliers is a feasible option to substantiate anomalies in large data sets, leading to effective intrusion detection and prevention. In this context we propose and investigate a novel mechanism to detect intruders and to classify security threats using group outliers. Our system monitors for timing and/or behavioral anomalies and uses outlier based techniques to substantiate the anomaly. In this paper we introduce the concept of Group Outlier Score (GOS) and its use in substantiating security threats in wireless networks. We have tested the concept on our experimental wireless networking environment. The analysis of the results reveals that with a threshold value of 1.2 for GOS our system demonstrates optimum performance.

show abstract

Parallelizing the Data Cube

Cited by 18 publications

References 23 publications

Efficient and Self-Balanced ROLLUP Aggregates for Large-Scale Data Summarization

Efficient and Self-Balanced ROLLUP Aggregates for Large-Scale Data Summarization

A distributed tree data structure for real-time OLAP on cloud architectures

Substantiating Anomalies In Wireless Networks Using Group Outlier Scores

Contact Info

Product

Resources

About