Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

Kim, Youngrang; Choi, Hyeonseong; Lee, John J.; Kim, Jik-Soo; Jei, Hyunseung; Roh, Hyun Seog

doi:10.1007/s10586-020-03144-9

Cited by 16 publications

(8 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the method, two parallelism schemas were utilized: data parallelism and model parallelism. In the same direction, Kim et al [88] proposed a distributed DL method based on heterogenous systems. The schema was built by using multiple heterogenous GPUs that worked together.…”

Section: A Parallel Deep Learning On Regular Domainsmentioning

confidence: 99%

A Review on Community Detection in Large Complex Networks from Conventional to Deep Learning Methods: A Call for the Use of Parallel Meta-Heuristic Algorithms

et al. 2021

View full text Add to dashboard Cite

Complex networks (CNs) have gained much attention in recent years due to their importance and popularity. The rapid growth in the size of CNs leads to more difficulties in the analysis of CNs tasks. Community Detection (CD) is an important multidisciplinary research area where many machine/deep learning-based methods have been applied to map CNs into a low-dimensional representation for extracting information similarity among members of CNs. Currently, Deep Learning (DL) is one of the promising methods to extract knowledge and learn information from high dimensional space and represent it in low dimensional space. However, designing an accurate and efficient DL-based CD method especially when dealing with large CNs is always an on-going research endeavor to pursue. Meta-Heuristic (MH) algorithms have shown their potentials in improving DL models in terms of solution quality and computational cost. In addition, Parallel computing is a feasible solution for building efficient DL models. The algorithmic principle of MH is parallel in nature; however, its computation framework in DL training that is reported in the literature is not really implemented in a parallel computing setup. In this paper, we present a systematic review of CD in CNs from conventional machine learning to DL methods and point out the gap of applying DL-based CD methods in large CNs. In addition, the relevant studies on DL with parallel and MH approaches are reviewed and their implications on DL models are highlighted to prospect effective solutions to overcome the challenges of DL-based CD methods. We also point out research challenges in the field of CD and suggest possible future research directions. INDEX TERMSCommunity detection; Deep learning; Complex networks; Meta-heuristic algorithms; Parallel computing.

show abstract

Section: A Parallel Deep Learning On Regular Domainsmentioning

confidence: 99%

A Review on Community Detection in Large Complex Networks from Conventional to Deep Learning Methods: A Call for the Use of Parallel Meta-Heuristic Algorithms

et al. 2021

View full text Add to dashboard Cite

show abstract

“…There exist a few works that specifically evaluate and/or improve the MPI CCPs for DL, for example, taking into account the special characteristics of the messages that are exchanged in this type of applications [3,4,18,23]. In addition, MPI-based software has been developed for distributed DNN training; for example, MVAPICH2-GDR 1 from Ohio State University or oneAPI 2 from Intel.…”

Section: Mpi Collective Communication Primitivesmentioning

confidence: 99%

Accelerating distributed deep neural network training with pipelined MPI allreduce

2021

View full text Add to dashboard Cite

TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter.

show abstract

“…Tencent's Mariana [36] used DP, which gained a 2:67Â speed increment with four GPUs. In recent years, more paralleled deep learning methods have been brought up [11]. In the aspect of algorithms, several algorithms have been brought up to accelerate multi-GPU implementation or make the inference more accurate [1,26] and faster [7,12].…”

Section: Multi-gpu Parallel Computingmentioning

confidence: 99%

Multi-task learning based on question–answering style reviews for aspect category classification and aspect term extraction on GPU clusters

Cheng

Wang

et al. 2020

Cluster Comput

View full text Add to dashboard Cite

Cluster computing technologies are rapidly advancing and user-generated online reviews are booming in the current Internet and e-commerce environment. The latest question–answering (Q&A)-style reviews are novel, abundant and easily digestible product reviews that also contain massive valuable information for customers. In this paper, we mine valuable aspect information of products contained in these reviews on GPU clusters. To achieve this goal, we utilize two subtasks of aspect-based sentiment analysis: aspect term extraction (ATE) and aspect category classification (ACC). Most previous works focused on only one task or solved these two tasks separately, even though they are highly interrelated, and they do not make full use of abundant training resources. To address this problem, we propose a novel multi-task neural learning model to jointly handle these two tasks and explore the performance of our model on GPU clusters. We conducted extensive comparative experiments on an annotated corpus and found that our proposed model outperforms several baseline models in ATE and ACC tasks on GPU clusters, yielding significant strides in data mining for these types of reviews.

show abstract

Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

Cited by 16 publications

References 6 publications

A Review on Community Detection in Large Complex Networks from Conventional to Deep Learning Methods: A Call for the Use of Parallel Meta-Heuristic Algorithms

A Review on Community Detection in Large Complex Networks from Conventional to Deep Learning Methods: A Call for the Use of Parallel Meta-Heuristic Algorithms

Accelerating distributed deep neural network training with pipelined MPI allreduce

Multi-task learning based on question–answering style reviews for aspect category classification and aspect term extraction on GPU clusters

Contact Info

Product

Resources

About