Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS

Won, Heesun; Gil, Myeong-Seon; Moon, Yang-Sae; Whang, Kyu-Young

doi:10.1007/s11227-016-1949-7

Cited by 14 publications

(5 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A recent study conducted by [24] showed that the k-means algorithm based on the Hadoop framework with 3 slave nodes achieves up to 2 times speedup against the sequential k-means. However, the overhead accompanied by reading and writing data to the local storage at each iteration harms the overall performance of the Hadoop framework [28].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines

Taamneh

Al-Hami

Bani-Salameh

et al. 2021

Data

View full text Add to dashboard Cite

Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algorithm with no fault-tolerant mechanisms achieved up to a 34% improvement over the Hadoop-based k-means algorithm, while the robust one achieved up to a 12% improvement. The experiments also showed that the overhead, using such techniques, was negligible. Moreover, the results indicated that losing up to 10% of the messages had no real impact on the overall performance.

show abstract

Section: Related Workmentioning

confidence: 99%

“…It was found to achieve a significant speedup over the sequential k-means algorithm. However, the overhead accompanied by storing and retrieving data to/from the HDFS at each iteration degrades the overall performance of the Hadoop framework [28]. Additionally, Hadoop is not suited for small files [35].…”

Section: Related Workmentioning

confidence: 99%

A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines

Taamneh

Al-Hami

Bani-Salameh

et al. 2021

Data

View full text Add to dashboard Cite

show abstract

“…If the Apache Hadoop is a model for reading and writing data processing based on disk, the Apache Spark performs in-memory calculations with the resilient distributed data sets. Apache Hadoop is an open-source Java-based distributed computing framework built for applications implemented using MapReduce parallel data processing paradigm [7] and Hadoop Distributed File System (HDFS) [8].…”

Section: Background and Motivationmentioning

confidence: 99%

Performance Optimization System for Hadoop and Spark Frameworks

Astsatryan

Kocharyan²,

Hagimont³

et al. 2020

Cybernetics and Information Technologies

View full text Add to dashboard Cite

The optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disks and memory but requires additional processing. Therefore, finding an optimal tradeoff is a challenge, as a high compression factor may underload Input/Output but overload the processor. The paper aims to present a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures based on simulation analyzes.

show abstract

“…The rapid development of Big Data frameworks addresses the distribution, communication, and processing of a vast number of data. For instance, Hadoop and Spark popular frameworks [2] may handle massive amounts of data relying on the MapReduce paradigm [3] to process and generate extensive data sets [4]. The data sets are stored across distributed clusters to run a distributed processing scheme in each cluster.…”

Section: Introductionmentioning

confidence: 99%

Performance-efficient Recommendation and Prediction Service for Big Data frameworks focusing on Data Compression and In-memory Data Storage Indicators

Astsatryan

Lalayan

Kocharyan³

et al. 2021

SCPE

View full text Add to dashboard Cite

The MapReduce framework manages Big Data sets by splitting the large datasets into a set of distributed blocks and processes them in parallel. Data compression and in-memory file systems are widely used methods in Big Data processing to reduce resource-intensive I/O operations and improve I/O rate correspondingly. The article presents a performance-efficient modular and configurable decision-making robust service relying on data compression and in-memory data storage indicators. The service consists of Recommendation and Prediction modules, predicts the execution time of a given job based on metrics, and recommends the best configuration parameters to improve Hadoop and Spark frameworks' performance. Several CPU and data-intensive applications and micro-benchmarks have been evaluated to improve the performance, including Log Analyzer, WordCount, and K-Means.

show abstract

Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS

Cited by 14 publications

References 19 publications

A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines

A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines

Performance Optimization System for Hadoop and Spark Frameworks

Performance-efficient Recommendation and Prediction Service for Big Data frameworks focusing on Data Compression and In-memory Data Storage Indicators

Contact Info

Product

Resources

About