Cross-Platform Resource Scheduling for Spark and MapReduce on YARN

Cheng, Dazhao; Zhou, Xiaobo; Lama, Palden; Wu, Jun; Jiang, Changjun

doi:10.1109/tc.2017.2669964

Cited by 39 publications

(27 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure illustrates the HDFS architecture, in which the coordinator NameNode schedules the processes with the metadata and the processes are assigned to a cluster of DataNodes . All the data are split into several blocks and stored in different DataNodes, and each block in other nodes has several replications . When a program requires access to a file, NameNode coordinates the relevant DataNode to respond and NameNode moves the files stored in the HDFS and simultaneously copies them to the other DataNodes.…”

Section: Related Workmentioning

confidence: 99%

“…The Hadoop dataset process has two sets of operations: map and reduce; by contrast, the Spark dataset process has several sets of operations, and the transformation and action instructions are summarized in Table . The transformation type operating instructions include map(), filter(), flatMap(), groupByKey(), reduceByKey(), join(), and various types of actions (the operating instructions include count(), collect(), reduce(), and save()).…”

Section: Proposed Approachmentioning

confidence: 99%

“…Spark uses a flexible RDD that has four attributes: (1) RDD partition; (2) a set of dependencies between parent RDDs may be either wide or narrow; (3) a function that performs calculations on the parent RDD; and (4) metadata-the description of the district of the data and mode of data storage. 7,8,16 RDD involves parallel computing and fault-tolerant data collection; however, the data in an RDD is read-only. An RDD is not an efficient interactive query, but the performance of iterative applications is higher with Spark than with Hadoop.…”

Section: 2mentioning

confidence: 99%

“…High‐frequency input/output (I/O) results in long duration data access, and thus, low performance. Furthermore, the transmission time of MapReduce is long; therefore, it can only be applied to batch data processing and not real‐time data processing …”

Section: Introductionmentioning

confidence: 99%

“…4,20,23 All the data are split into several blocks and stored in different DataNodes, and each block in other nodes has several replications. 8,22 When a program requires access to a file, NameNode coordinates the relevant DataNode to respond and NameNode moves the files stored in the HDFS and simultaneously copies them to the other DataNodes.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Performance enhancement for iterative data computing with in‐memory concurrent processing

Wen

Chen

Chiu

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary The big data era has resulted in the development of several data analysis tools. Spark is a type of in‐memory processing fitted iteration and interactive data mining tool. This tool possesses higher data‐processing performance than MapReduce, which is an offline storage mechanism. However, some disadvantages of in‐memory processing, such as massive in‐memory data requirements, cause cross‐node data transfer that result in a long computation time. The performance of the process can be improved if the in‐memory process is executed with fewer shuffle instructions. Therefore, this study aims to enhance the performance of iterative application through instruction replacement. Three empirical research cases with diverse datasets and iterations are used to modify the program. We adopt a strategy of downloading a small resilient distributed dataset and replacing the shuffle‐included instructions to shorten the processing time with an automated code replacement by using exhaustively code matching. The experimental results reveal an improvement of up to 39% in the execution time compared with the existing in‐memory processing programs with various dataset sizes.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Proposed Approachmentioning

confidence: 99%