Scalable Data Placement of Data-intensive Services in Geo-distributed Clouds

Atrey, Ankita; Seghbroeck, Gregory Van; Volckaert, Bruno; Turck, Filip De

doi:10.5220/0006767504970508

Cited by 5 publications

(15 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More specifically, any change (small or large) in the system workload would require re-execution of the full pipeline to obtain the placement output. This design decision is in line with almost every existent technique [3]- [5], [16], [17], [43], [49] in the extensive literature on data placement. Thus, making the CDR placement algorithm dynamically adapt to the changes in the system workload is not in the scope of the current work.…”

Section: Combined Data and Replica Placementsupporting

confidence: 67%

“…On the other hand, publicly available specialized heuristics for hypergraph partitioning [7] enable graceful scaling of the aforementioned methods to large datasets. Moving further, Atrey et al [3], [5] proposed an algorithm based on spectral clustering of hypergraphs, which portrayed quality similar to the algorithms proposed in [43], however, achieved superior efficiency and scalability owing to the use of randomized eigendecomposition techniques for factorizing the hypergraph laplacian.…”

Section: Related Workmentioning

confidence: 98%

“…where, the vector W controls the relative importance of the aforementioned hyperedge weight assignment strategies 5 to obtain the final diagonal hyperedge weight matrix W Π of size m × m . To summarize, the hypergraph modeling step produces a hypergraph incidence matrix Π and a hyperedge weight matrix W Π , where the former (Π) models the higher-order interaction between data-items and nodes, while the latter (W Π ) controls the relative importance of different hyperedges.…”

Section: B Cdr-multi: Hypergraph Modelingmentioning

confidence: 99%

“…• Spectral Clustering (Spectral): obtains a placement using spectral clustering on hypergraphs and achieves superior efficiency by leveraging randomized eigendecomposition methods. Spectral was proposed in [5].…”

Section: ) Setupmentioning

confidence: 99%

“…Recall that as described in Sec. II, the representative stateof-the-art for data placement and replication of data-intensive services into geographically distributed clouds is comprised of the techniques Spectral [5] and Hyper [43].…”

Section: ) Setupmentioning

confidence: 99%

See 4 more Smart Citations

UnifyDR: A Generic Framework for Unifying Data and Replica Placement

et al. 2020

Self Cite

View full text Add to dashboard Cite

The advent of (big) data management applications operating at Cloud scale has led to extensive research on the data placement problem. The key objective of data placement is to obtain a partitioning (possibly allowing for replicas) of a set of data-items into distributed nodes that minimizes the overall network communication cost. Although replication is intrinsic to data placement, it has seldom been studied in combination with the latter. On the contrary, most of the existing solutions treat them as two independent problems, and employ a two-phase approach: (1) data placement, followed by (2) replica placement. We address this by proposing a new paradigm, CDR, with the objective of combining data and replica placement as a single joint optimization problem. Specifically, we study two variants of the CDR problem: (1) CDR-Single, where the objective is to minimize the communication cost alone, and (2) CDR-Multi, which performs a multi-objective optimization to also minimize traffic and storage costs. To unify data and replica placement, we propose a generic framework called UnifyDR, which leverages overlapping correlation clustering to assign a data-item to multiple nodes, thereby facilitating data and replica placement to be performed jointly. We establish the generic nature of UnifyDR by portraying its ability to address the CDR problem in two real-world use-cases, that of join-intensive online analytical processing (OLAP) queries and a location-based online social network (OSN) service. The effectiveness and scalability of UnifyDR are showcased by experiments performed on data generated using the TPC-DS benchmark and a trace of the Gowalla OSN for the OLAP queries and OSN service use-case, respectively. Empirically, the presented approach obtains an improvement of approximately 35% in terms of the evaluated metrics and a speed-up of 8 times in comparison to state-of-the-art techniques.

show abstract

Section: Combined Data and Replica Placementsupporting

confidence: 67%

Section: Related Workmentioning

confidence: 98%

Section: B Cdr-multi: Hypergraph Modelingmentioning

confidence: 99%

Section: ) Setupmentioning

confidence: 99%

Section: ) Setupmentioning

confidence: 99%

See 3 more Smart Citations

UnifyDR: A Generic Framework for Unifying Data and Replica Placement

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

SpeCH: A scalable framework for data placement of data-intensive services in geo-distributed clouds

Atrey

Seghbroeck

Mora

et al. 2019

Journal of Network and Computer Applications

Self Cite

View full text Add to dashboard Cite

The advent of big data analytics and cloud computing technologies has resulted in wide-spread research on the data placement problem. Since data-intensive services require access to multiple datasets within each transaction, traditional schemes of uniformly partitioning the data into distributed nodes, as employed by many popular data stores like HDFS or Cassandra, may cause network congestion thereby affecting system throughput. In this article, we propose a scalable and unified framework for data-intensive service data placement into geographically distributed clouds. The proposed framework introduces a new paradigm for partitioning a set of data-items into geo-distributed clouds using Spectral Clustering on Hypergraphs, and is therefore called SpeCH.Scaling spectral methods to large workloads is challenging, since computing the spectra of the hypergraph laplacian is a computationally intensive task. SpeCH provides two solutions to tackle this problem: (1) an algorithm, called SpectralApprox, that leverages randomized techniques for obtaining low-rank approximations of the hypergraph matrix with bounded guarantees, thereby significantly improving the efficiency of spectral clustering while also providing high quality solutions in practice; (2) an algorithm, called SpectralDist, that exploits the highly parallel nature of the spectral clustering algorithm and uses Apache Spark to speed-up the process while retaining the same quality guarantees as the exact algorithm. Additionally, being distributed in nature, Spec-tralDist enables SpeCH to perform data placement on workloads that require resources beyond the capacity of a single machine. Experiments on a real-world trace-based online social network dataset show that the SpeCH is effective, efficient, and scalable. Empirically, SpectralApprox is comparable in efficacy on the evaluated metrics, while being up to 10 times faster in execution time when compared to state-of-the-art techniques. On the other hand, though SpectralApprox is 7-8 times faster when compared to SpectralDist, in terms of efficacy on the evaluated metrics the latter is up to 50% better.

show abstract

A Survey on Machine Learning for Geo-Distributed Cloud Data Center Management

Hogade

Pasricha

2023

IEEE Trans. Sustain. Comput.

View full text Add to dashboard Cite

Scalable Data Placement of Data-intensive Services in Geo-distributed Clouds

Cited by 5 publications

References 25 publications

UnifyDR: A Generic Framework for Unifying Data and Replica Placement

UnifyDR: A Generic Framework for Unifying Data and Replica Placement

SpeCH: A scalable framework for data placement of data-intensive services in geo-distributed clouds

A Survey on Machine Learning for Geo-Distributed Cloud Data Center Management

Contact Info

Product

Resources

About