A Novel Mixed Precision Distributed TPU GAN for Accelerated Learning Curve

Ravikumar, Aswathy; Sriraman, Harini

doi:10.32604/csse.2023.034710

Cited by 5 publications

(2 citation statements)

References 17 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The impact of the multinode on spark and GPU was examined for medical use cases [35], [38]. The impact of multi-node TPU on GAN for double precision was developed but lacked deployment and did not address the current bottleneck issues in TPU [34]. The proposed model is superior to the existing works since it uses multi-GPU GAN implementation addressing the bottleneck issues, ensures the deployment and continuous retraining of the model, and makes it usable for real-time applications.…”

Section: Key Takeawaysmentioning

confidence: 99%

Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment

Ravikumar

Sriraman

2023

IEEE Access

Self Cite

View full text Add to dashboard Cite

Due to its fantastic performance in the quality of the images created, Generator Adversarial Networks have recently become a viable option for image reconstruction. The main problem with employing GAN is how expensive the computations are. Researchers have developed techniques for distributing GANs across multiple nodes. However, these techniques typically do not scale because they frequently separate the components (Discriminator and Generator), leading to high communication overhead or encountering distribution-related problems unique to GAN training. In this study, the training procedure for the GAN is parallelized and carried out over many Graphical Processing Units (GPUs). TensorFlow's built-in logic and a custom loop were tweaked for more control over the resources allotted to each GPU worker. In this study, GPU image processing improvements and multi-GPU learning are used. The GAN model is accelerated using Distributed TensorFlow with synchronous data-parallel training on a single system and several GPUs. Acceleration was accomplished using the Genesis Cloud Platform and the NVIDIA®GeForceTM GTX 108 GPU accelerator. The speed-up of 1.322 for two GPUs, 1.688 for three GPUs, and 1.7792 for four GPUs using multi-GPU acceleration. The parameter server model's data initialization and image production bottlenecks are removed, but the results' speed-up is not linear. Increasing the number of GPUs and removing the connectivity constraint will accelerate things even more. The bottlenecks are detected using new network lines and resources, and solutions are suggested. Recomputation and quantization are the two techniques to reduce the amount of GPU acceleration in memory. Deployment and versioning are essential for successfully operating multi-node GAN models in MLflow. Properly deploying and versioning these models can improve scalability, reproducibility, and collaboration across teams working on the same model. MLflow provides built-in tools for versioning and tracking model performance, making it easier to manage multiple versions of the model and reproduce it in different environments.

show abstract

Section: Key Takeawaysmentioning

confidence: 99%

Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment

Ravikumar

Sriraman

2023

IEEE Access

Self Cite

View full text Add to dashboard Cite

show abstract

“…Utilizing high-performance hardware isn't the only option; one potential alternative is to parallelize and provide DNN training operations on many nodes instead. Under these circumstances, the amount of work that each node contributes to the calculation is minimal at best [ [3] , [4] , [5] , [6] , [7] ]. Despite this, the communication delay is a critical obstacle in distributed training due to the frequent communication requirements for delivering vast volumes of data across multiple computing nodes.…”

Section: Introductionmentioning

confidence: 99%