Distributed big data analysis using spark parallel data processing

Omar, Hoger K.; Jumaa, Alaa Khalil

doi:10.11591/eei.v11i3.3187

Cited by 7 publications

(4 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Every node receives a tiny batch of data, which is then used to compute its gradient and send it back to the central node. Distributed training uses synchronous and asynchronous [29], [30].…”

Section: B Data Parallel Modelmentioning

confidence: 99%

Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment

Ravikumar

Sriraman

2023

IEEE Access

View full text Add to dashboard Cite

Due to its fantastic performance in the quality of the images created, Generator Adversarial Networks have recently become a viable option for image reconstruction. The main problem with employing GAN is how expensive the computations are. Researchers have developed techniques for distributing GANs across multiple nodes. However, these techniques typically do not scale because they frequently separate the components (Discriminator and Generator), leading to high communication overhead or encountering distribution-related problems unique to GAN training. In this study, the training procedure for the GAN is parallelized and carried out over many Graphical Processing Units (GPUs). TensorFlow's built-in logic and a custom loop were tweaked for more control over the resources allotted to each GPU worker. In this study, GPU image processing improvements and multi-GPU learning are used. The GAN model is accelerated using Distributed TensorFlow with synchronous data-parallel training on a single system and several GPUs. Acceleration was accomplished using the Genesis Cloud Platform and the NVIDIA®GeForceTM GTX 108 GPU accelerator. The speed-up of 1.322 for two GPUs, 1.688 for three GPUs, and 1.7792 for four GPUs using multi-GPU acceleration. The parameter server model's data initialization and image production bottlenecks are removed, but the results' speed-up is not linear. Increasing the number of GPUs and removing the connectivity constraint will accelerate things even more. The bottlenecks are detected using new network lines and resources, and solutions are suggested. Recomputation and quantization are the two techniques to reduce the amount of GPU acceleration in memory. Deployment and versioning are essential for successfully operating multi-node GAN models in MLflow. Properly deploying and versioning these models can improve scalability, reproducibility, and collaboration across teams working on the same model. MLflow provides built-in tools for versioning and tracking model performance, making it easier to manage multiple versions of the model and reproduce it in different environments.

show abstract

Section: B Data Parallel Modelmentioning

confidence: 99%

Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment

Ravikumar

Sriraman

2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…MapReduce is a programming methodology created by Google to handle large-scale data analysis. It is based on the Hadoop framework [11], [58], [78]- [81], [64]- [67], [69]- [71], [74]. It is employed in a wide variety of applications.…”

Section: Mapreducementioning

confidence: 99%

Big data classification based on improved parallel k-nearest neighbor

Ali

Mohammed²,

Hasan³

et al. 2023

TELKOMNIKA

View full text Add to dashboard Cite

In response to the rapid growth of many sorts of information, highway data has continued to evolve in the direction of big data in terms of scale, type, and structure, exhibiting characteristics of multi-source heterogeneous data. The k-nearest neighbor (KNN) join has received a lot of interest in recent years due to its wide range of applications. Processing KNN joins is time-consuming and inefficient due to the quadratic structure of the join method. As the number of applications dealing with vast amounts of data develops, KNN joins get more sophisticated. The authors seek to save money on computer resources by leveraging a large number of threads and multiprocessors. Six popular datasets are used to apply the method and evaluate the sequential and parallel performance of the KNN technique. These datasets are used to compare the sequential and parallel performance of the KNN method. When compared to a matching multi-core solution, the final implementation saves computing resources. It has been optimized to utilize as little RAM as possible, allowing it to manage high-resolution photo data without sacrificing efficiency. The authors will use the technique they presented using Spark Radoop. Our performance research validates the supplied method's efficacy and scalability.

show abstract

“…Also, the characteristics of Spark are appropriate from the bottom-up for treating big data and it is much faster than other big data tools such as Hadoop. Besides, it supports many programming languages such as Java, Scala, Python, and R [19]. Fortunately, the Spark machine learning library consists of an implementation of the ALS algorithm for building a model in the form of collaborative filtering [20].…”

Section: Alternating Least Squares With Sparkmentioning

confidence: 99%

Big data cloud-based recommendation system using NLP techniques with machine and deep learning

Omar¹,

Frikha²,

Jumaa³

2023

TELKOMNIKA

View full text Add to dashboard Cite

Recommendation systems (RS) are crucial for social networking sites. Without it, finding precise products is harder. However, existing systems lack adequate efficiency, especially with big data. This paper presents a prototype cloud-based recommendation system for processing big data. The proposed work is implemented by utilizing the matrix factorization method with three approaches. In the first approach, singular value decomposition (SVD) is used, which is an old and traditional recommendation technique. The second recommendation approach is fine-tuned using the alternating least squares (ALS) algorithm with Apache Spark. Finally, the deep neural network (DNN) algorithm is utilized with TensorFlow. This study solves the challenge of handling large-scale datasets in the collaborative filtering (CF) technique after tuning the algorithms by adjusting the parameters in the second approach, which uses machine learning, as well as in the third approach, which uses deep learning. Furthermore, the results of these two approaches outperformed conventional techniques and achieved an acceptable computational time. The dataset size is about 1.5 GB and it is collected from the Goodreads website API. Moreover, the Hadoop distributed file system (HDFS) is used as cloud storage instead of the computer's local disk for handling larger dataset sizes in the future.

show abstract

Distributed big data analysis using spark parallel data processing

Cited by 7 publications

References 19 publications

Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment

Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment

Big data classification based on improved parallel k-nearest neighbor

Big data cloud-based recommendation system using NLP techniques with machine and deep learning

Contact Info

Product

Resources

About