Scalability analysis of declustering methods for multidimensional range queries

Moon, Bongki; Saltz, Joel

doi:10.1109/69.683759

Cited by 32 publications

(28 citation statements)

References 40 publications

(52 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several research projects have looked at improving I/O performance using different declustering techniques [9,14]. Parallel file systems and I/O libraries have also been a widely studied research topic, and many such systems and libraries have been developed [3,7,13].…”

Section: Related Workmentioning

confidence: 99%

“…Chunks in each replica that intersect the query are categorized as partial or full chunks and into different fragments, and the respective goodness values of the fragments are calculated (steps 2-6). For a given query Q, let us denote the set of all fragments as F and the list of all chosen fragments in decreasing order of goodness value as S. We can apply our greedy search over F (the while loop over steps [8][9][10][11][12][13][14][15][16][17][18]. We choose the fragment with the largest goodness value, move it from F to S, and modify Q by subtracting the range contained by this fragment.…”

Section: Replica Selection Algorithmmentioning

confidence: 99%

“…Efficient access to data also depends on how well the data has been distributed across multiple storage nodes. The goal of declustering [9,14] is to distribute the data across as many storage units as possible so that data elements that satisfy a query can be retrieved from many sources in parallel. Caching is yet another optimization that targets multiple query workloads [1,10,19,21].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Servicing range queries on multidimensional datasets with partial replicas

Weng

Çatalyürek

Kurç

et al. 2005

CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005.

View full text Add to dashboard Cite

Partial replication is one type of optimization to speed up execution of queries submitted to large datasets. In partial replication, a portion of the dataset is extracted, re-organized, and re-distributed across the storage system. The objective is to reduce the volume of I/O and increase I/O parallelism for different types of queries and for the portions of the dataset that are likely to be accessed frequently. When multiple partial replicas of a dataset exist, query execution plan should be generated so as to use the best combination of subsets of partial replicas (and possibly the original dataset) to minimize query execution time.In this paper, we present a compiler and runtime approach for range queries submitted against distributed scientific datasets. A heuristic algorithm is proposed to choose the set of replicas to reduce query execution. We show the efficiency of the proposed method using datasets and queries in oil reservoir simulation studies on a cluster machine.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Replica Selection Algorithmmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Servicing range queries on multidimensional datasets with partial replicas

Weng

Çatalyürek

Kurç

et al. 2005

CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005.

View full text Add to dashboard Cite

show abstract

“…In addition, distributing the input dataset across multiple storage nodes has the advantage that data retrieval can be parallelized. A number of techniques have been developed for partitioning and declustering multi-dimensional datasets [15,16,31,33]. Obviously, the effectiveness of a particular distribution depends on how well it matches the common data access and query patterns of the target application class.…”

Section: Data Distribution Among Storage Nodesmentioning

confidence: 99%

A Parallel Implementation of 4-Dimensional Haralick Texture Analysis for Disk-Resident Image Datasets

Woods

Clymer

Saltz

et al.

Proceedings of the ACM/IEEE SC2004 Conference

View full text Add to dashboard Cite

show abstract

“…Since data is accessed through range queries, it is desirable to have data items that are close to each other in the multi-dimensional space placed in the same chunk. Chunks are distributed across the disks attached to ADR back-end nodes using a declustering algorithm [10,16] to achieve I/O parallelism during query processing. Each chunk is assigned to a single disk, and is read and/or written during query processing only by the local processor to which the disk is attached.…”

Section: Storing Datasets In Adrmentioning

confidence: 99%

Optimizing retrieval and processing of multi-dimensional scientific datasets

Chang

Kurç

Sussman

et al.

Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000

View full text Add to dashboard Cite

Exploring and analyzing large volumes of data plays an increasingly important role in many domains of scientific research. We have been developing the Active Data Repository (ADR), an infrastructure that integrates storage, retrieval, and processing of large multi-dimensional scientific datasets on distributed memory parallel machines with multiple disks attached to each node. In earlier work, we proposed three strategies for processing range queries within the ADR framework. Our experimental results show that the relative performance of the strategies changes under varying application characteristics and machine configurations. In this work we investigate approaches to guide and automate the selection of the best strategy for a given application and machine configuration. We describe analytical models to predict the relative performance of the strategies when input data elements are uniformly distributed in the attribute space of the output dataset, restricting the output dataset to be a regular d-dimensional array. We present an experimental evaluation of these models for various synthetic datasets and for several driving applications on a 128-node IBM SP.

show abstract

Scalability analysis of declustering methods for multidimensional range queries

Cited by 32 publications

References 40 publications

Servicing range queries on multidimensional datasets with partial replicas

Servicing range queries on multidimensional datasets with partial replicas

A Parallel Implementation of 4-Dimensional Haralick Texture Analysis for Disk-Resident Image Datasets

Optimizing retrieval and processing of multi-dimensional scientific datasets

Contact Info

Product

Resources

About