Design and evaluation of small–large outer joins in cloud computing environments

Cheng, Long; Tachmazidis, Ilias; Kotoulas, Spyros; Antoniou, Grigoris

doi:10.1016/j.jpdc.2017.02.007

Cited by 28 publications

(18 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most iterative applications require the intermediate results as inputs for the next iteration, resulting in wastage of time on the same set of data's I/O and shuffle . Thus, an instruction replacement method enables efficient small–large outer joins in decentralized environments; moreover, this method is easy to implement using existing predicates in data‐processing frameworks.…”

Section: Discussionmentioning

confidence: 99%

“…Cheng et al investigated a global collection of statistics, redundant computation, data backup, and network access overhead. They proposed a partial redistribution and partial query method to improve performance and create a robust join operation with large datasets in a cluster environment . This study considers a similar situation wherein the join operation is replaced for datasets of various sizes to improve performance.…”

Section: Related Workmentioning

confidence: 99%

“…The MapReduce architecture, which has advantages such as high fault tolerance, ductility, reliability, and energy efficiency, can operate the cluster with concurrent computing levels of data in the range of terabytes . However, Hadoop MapReduce possesses a limitation in that the intermediate results of MapReduce tasks are generated and stored in offline storage after execution, irrespective of the data volume . Iteration‐based massive data processes are not as effective as expected because of the same set of data must be swapped in and out from the offline storage in the MapReduce model .…”

Section: Introductionmentioning

confidence: 99%

“…3,4 However, Hadoop MapReduce possesses a limitation in that the intermediate results of MapReduce tasks are generated and stored in offline storage after execution, irrespective of the data volume. 5,6 Iteration-based massive data processes are not as effective as expected because of the same set of data must be swapped in and out from the offline storage in the MapReduce model. 7 High-frequency input/output (I/O) results in long duration data access, and thus, low performance.Furthermore, the transmission time of MapReduce is long; therefore, it can only be applied to batch data processing and not real-time data processing.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Performance enhancement for iterative data computing with in‐memory concurrent processing

Wen

Chen

Chiu

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary The big data era has resulted in the development of several data analysis tools. Spark is a type of in‐memory processing fitted iteration and interactive data mining tool. This tool possesses higher data‐processing performance than MapReduce, which is an offline storage mechanism. However, some disadvantages of in‐memory processing, such as massive in‐memory data requirements, cause cross‐node data transfer that result in a long computation time. The performance of the process can be improved if the in‐memory process is executed with fewer shuffle instructions. Therefore, this study aims to enhance the performance of iterative application through instruction replacement. Three empirical research cases with diverse datasets and iterations are used to modify the program. We adopt a strategy of downloading a small resilient distributed dataset and replacing the shuffle‐included instructions to shorten the processing time with an automated code replacement by using exhaustively code matching. The experimental results reveal an improvement of up to 39% in the execution time compared with the existing in‐memory processing programs with various dataset sizes.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Performance enhancement for iterative data computing with in‐memory concurrent processing

Wen

Chen

Chiu

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…In fact, these operations have been extensively studied in the field of data management, and large number of methods have been proposed to improve their performance. For example, for distributed join executions, current research focuses on the challenge on how to efficiently move data, either in the presence of different join workloads (e.g., skew) or different computing platforms (e.g., clusters and Cloud) or both [4], [5], [6], [11], [16], [20]. Their main target is either to reduce network traffic or to improve load-balancing or both, so as to balance computations and improve network communication time.…”

Section: Related Workmentioning

confidence: 99%

A Coflow-Based Co-Optimization Framework for High-Performance Data Analytics

Cheng

Wang

Pei

et al. 2017

2017 46th International Conference on Parallel Processing (ICPP)

Self Cite

View full text Add to dashboard Cite

Abstract-Efficient execution of distributed database operators such as joining and aggregating is critical for the performance of big data analytics. With the increase of the compute speedup of modern CPUs, reducing the network communication time of these operators in large systems is becoming increasingly important, and also challenging current techniques. Significant performance improvements have been achieved by using state-of-the-art methods, such as reducing network traffic designed in the data management domain, and data flow scheduling in the data communications domain. However, the proposed techniques in both fields just view each other as a black box, and performance gains from a co-optimization perspective have not yet been explored.In this paper, based on current research in coflow scheduling, we propose a novel Coflow-based Co-optimization Framework (CCF), which can co-optimize application-level data movement and network-level data communications for distributed operators, and consequently contribute to their performance in large distributed environments. We present the detailed design and implementation of CCF, and conduct an experimental evaluation of CCF using large-scale simulations on large data joins. Our results demonstrate that CCF can always perform faster than current approaches on network communications in large-scale distributed scenarios.

show abstract

Deterministic and non‐deterministic query optimization techniques in the cloud computing

Azhir

Navimipour

Hosseinzadeh

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Query optimization is considered as one of the main challenges of query processing phases in the cloud environments. The query optimizer attempts to provide the most optimal execution plan by considering the possible query plans. Therefore, the execution cost of a query can be affected by some factors, including communication costs, unavailability of resources, and access to large distributed data sets. In addition, it is known as NP-hard problem and many researchers are focused on this problem in recent years. Some techniques are proposed for solving this problem. Deterministic and non-deterministic methods are two main categories to study these techniques. The deterministic and non-deterministic query optimization methods can be further divided into three subcategories, cost-based query plan enumeration, multiple query optimization, and adaptive query optimization methods. Moreover, this paper presents the advantages and disadvantages of the algorithms for solving the query optimization problems in the cloud environments. Moreover, these techniques are compared in terms of optimization, time, cost, efficiency, and scalability. Finally, some key areas are offered to improve the cloud query optimization mechanisms in the future. KEYWORDScloud computing, database, query optimization, review INTRODUCTIONThe data transfer operation and resource sharing are facilitated by rapid progress of the distributed IT-based systems. 1,2 Cloud computing supports several computers through a network. 3 The cloud computing has a large-scale distributed architecture and virtualized services to deliver the requests to users. 4,5 Moreover, the cloud computing provides important financial advantages and long level cooperation possibilities for organizations and institutions. 6 The cloud computing is defined as a distributed IT-based technology based on service business model. 7This paradigm provides many benefits for users, such as the provision of computing capabilities, heterogeneous network access, scalability, and elasticity with measured services. 8,9 The cloud computing gives shared access to a large pool of resources, including data storage, memory, processing, and virtual machines. 10 A cloud client, such as a web browser and mobile app can be helpful in accessing these services. 11 Enormous amounts of data are retrieved from geo-distributed data sources and cross-layer data-handling requirements to make a change in business model. 12 The cloud storage as one of the main services is provided by cloud computing, 13 which allows the users to store their data in virtual pools instead of their servers. 14 In addition, subscribers can access the data from any area of cloud. 15 Therefore, the reliability and availability are necessary to recover the information and query processing.The query processing involves three main steps, as shown in Figure 1. First, the query is translated into an expression of the relational algebra.Second, an optimal evaluation plan for the query plan is generated. The query optimization is the main part o...

show abstract

Design and evaluation of small–large outer joins in cloud computing environments

Cited by 28 publications

References 36 publications

Performance enhancement for iterative data computing with in‐memory concurrent processing

Performance enhancement for iterative data computing with in‐memory concurrent processing

A Coflow-Based Co-Optimization Framework for High-Performance Data Analytics

Deterministic and non‐deterministic query optimization techniques in the cloud computing

Contact Info

Product

Resources

About