Data redundancy and duplicate detection in spatial join processing

Dittrich, Jens; Seeger, Bernhard

doi:10.1109/icde.2000.839452

Cited by 68 publications

(65 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Doing so has the advantage that only elements in the same partition need to be compared to perform the spatial join. Replicating elements, however, has several disadvantages: 1) replicated elements need more space on disk as well as more disk reads and more comparisons for the join and 2) results may be detected twice and deduplication is required (at runtime [25] or at the end).…”

Section: B Space-oriented Partitioningmentioning

confidence: 99%

TRANSFORMERS: Robust spatial joins on non-uniform data distributions

Pavlović

Heinis

Tauheed³

et al. 2016

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

Abstract-Spatial joins are becoming increasingly ubiquitous in many applications, particularly in the scientific domain. While several approaches have been proposed for joining spatial datasets, each of them has a strength for a particular type of density ratio among the joined datasets. More generally, no single proposed method can efficiently join two spatial datasets in a robust manner with respect to their data distributions. Some approaches do well for datasets with contrasting densities while others do better with similar densities. None of them does well when the datasets have locally divergent data distributions.In this paper we develop TRANSFORMERS, an efficient and robust spatial join approach that is indifferent to such variations of distribution among the joined data. TRANSFORM-ERS achieves this feat by departing from the state-of-the-art through adapting the join strategy and data layout to local density variations among the joined data. It employs a join method based on data-oriented partitioning when joining areas of substantially different local densities, whereas it uses big partitions (as in space-oriented partitioning) when the densities are similar, while seamlessly switching among these two strategies at runtime. We experimentally demonstrate that TRANSFORMERS outperforms state-of-the-art approaches by a factor of between 2 and 8.

show abstract

Section: B Space-oriented Partitioningmentioning

confidence: 99%

TRANSFORMERS: Robust spatial joins on non-uniform data distributions

Pavlović

Heinis

Tauheed³

et al. 2016

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

show abstract

“…This means that queries such as those that seek the length of all objects in a particular spatial region will have to remove duplicate objects before reporting the total length. Nevertheless, methods have been developed that avoid these duplicates by making use of the geometry of the type of the data that is being represented (e.g., (Aref and Samet, 1992;Aref and Samet, 1994;Dittrich and Seeger, 2000)). Note that the result of constraining the positions of the partitions means that there is a limit on the possible sizes of the resulting cells (e.g., a power of 2 in the case of a quadtree variant).…”

Section: Methods Based On Spatial Occupancymentioning

confidence: 99%

Sorting Spatial Data by Spatial Occupancy

Samet

2009

GeoSpatial Visual Analytics

View full text Add to dashboard Cite

Abstract. The increasing popularity of web-based mapping services such as Microsoft Virtual Earth and Google Maps/Earth has led to a dramatic increase in awareness of the importance of location as a component of data for the purposes of further processing as a means of enhancing the value of the nonspatial data and of visualization. Both of these purposes inevitably involve searching. The efficiency of searching is dependent on the extent to which the underlying data is sorted. The sorting is encapsulated by the data structure known as an index that is used to represent the spatial data thereby making it more accessible. The traditional role of the indexes is to sort the data, which means that they order the data. However, since generally no ordering exists in dimensions greater than 1 without a transformation of the data to one dimension, the role of the sort process is one of differentiating between the data and what is usually done is to sort the spatial objects with respect to the space that they occupy. The resulting ordering should be implicit rather than explicit so that the data need not be resorted (i.e., the index need not be rebuilt) when the queries change. The indexes are said to order the space and the characteristics of such indexes are explored further.

show abstract

“…First, the R*-tree index of R P is built. Next, each spatial object s in S P is traversed; R*-tree is searched for the spatial objects in R P whose MBR overlaps with the MBR of s. Prior to the spatial predicate verification, we use the reference point method [37] to avoid duplicates. If the reference point of r and s is not in the partition, the current calculation is terminated.…”

Section: In-memory Spatial Joinmentioning

confidence: 99%

A New Design of High-Performance Large-Scale GIS Computing at a Finer Spatial Granularity: A Case Study of Spatial Join with Spark for Sustainability

et al. 2016

View full text Add to dashboard Cite

Sustainability research faces many challenges as respective environmental, urban and regional contexts are experiencing rapid changes at an unprecedented spatial granularity level, which involves growing massive data and the need for spatial relationship detection at a faster pace. Spatial join is a fundamental method for making data more informative with respect to spatial relations. The dramatic growth of data volumes has led to increased focus on high-performance large-scale spatial join. In this paper, we present Spatial Join with Spark (SJS), a proposed high-performance algorithm, that uses a simple, but efficient, uniform spatial grid to partition datasets and joins the partitions with the built-in join transformation of Spark. SJS utilizes the distributed in-memory iterative computation of Spark, then introduces a calculation-evaluating model and in-memory spatial repartition technology, which optimize the initial partition by evaluating the calculation amount of local join algorithms without any disk access. We compare four in-memory spatial join algorithms in SJS for further performance improvement. Based on extensive experiments with real-world data, we conclude that SJS outperforms the Spark and MapReduce implementations of earlier spatial join approaches. This study demonstrates that it is promising to leverage high-performance computing for large-scale spatial join analysis. The availability of large-sized geo-referenced datasets along with the high-performance computing technology can raise great opportunities for sustainability research on whether and how these new trends in data and technology can be utilized to help detect the associated trends and patterns in the human-environment dynamics.

show abstract

Data redundancy and duplicate detection in spatial join processing

Cited by 68 publications

References 21 publications

TRANSFORMERS: Robust spatial joins on non-uniform data distributions

TRANSFORMERS: Robust spatial joins on non-uniform data distributions

Sorting Spatial Data by Spatial Occupancy

A New Design of High-Performance Large-Scale GIS Computing at a Finer Spatial Granularity: A Case Study of Spatial Join with Spark for Sustainability

Contact Info

Product

Resources

About