“…Section 8 of the paper discusses and highlights the related research papers and their explanations, emphasizing the differences between those papers and the current study. One of the papers examined in this section focuses on analyzing latency in various communication libraries in both inter-node and intra-node environments [25]. It delves into the collective communication functions commonly used in distributed deep learning, providing a detailed investigation of each library's performance.…”
In distributed deep learning, the improper use of the collective communication library can lead to a decline in deep learning performance due to increased communication time. Representative collective communication libraries such as MPI, GLOO, and NCCL exhibit varying performance based on server environment and communication architecture. In this study, we investigate three key aspects to evaluate the performance of the collective communication libraries in a distributed deep learning setting in an intra-node environment. First, we conduct a comparison and analysis of collective communication library performance within common distributed deep learning architectures, such as parameter servers and ring all-reduce methods. Second, we evaluate the performance of these libraries in different environments, including various container platforms and bare metal setups, considering the scalability and flexibility advantages offered by cloud virtualization. Last, to ensure practicality, we assess the libraries’ performance in a Linux shell and within the PyTorch framework. In the cross-docker virtualization environment, NCCL shows up to 213% higher latency compared to single docker, while GLOO exhibits 36% lower latency in single docker than in cross docker, and NCCL achieves up to 345% lower execution time in all-reduce operations compared to other libraries (MPI and GLOO). These findings will inform the selection of an appropriate collective communication library for designing effective distributed deep learning environments.
“…Section 8 of the paper discusses and highlights the related research papers and their explanations, emphasizing the differences between those papers and the current study. One of the papers examined in this section focuses on analyzing latency in various communication libraries in both inter-node and intra-node environments [25]. It delves into the collective communication functions commonly used in distributed deep learning, providing a detailed investigation of each library's performance.…”
In distributed deep learning, the improper use of the collective communication library can lead to a decline in deep learning performance due to increased communication time. Representative collective communication libraries such as MPI, GLOO, and NCCL exhibit varying performance based on server environment and communication architecture. In this study, we investigate three key aspects to evaluate the performance of the collective communication libraries in a distributed deep learning setting in an intra-node environment. First, we conduct a comparison and analysis of collective communication library performance within common distributed deep learning architectures, such as parameter servers and ring all-reduce methods. Second, we evaluate the performance of these libraries in different environments, including various container platforms and bare metal setups, considering the scalability and flexibility advantages offered by cloud virtualization. Last, to ensure practicality, we assess the libraries’ performance in a Linux shell and within the PyTorch framework. In the cross-docker virtualization environment, NCCL shows up to 213% higher latency compared to single docker, while GLOO exhibits 36% lower latency in single docker than in cross docker, and NCCL achieves up to 345% lower execution time in all-reduce operations compared to other libraries (MPI and GLOO). These findings will inform the selection of an appropriate collective communication library for designing effective distributed deep learning environments.
“…Allreduce operator has a wide range of applications in the fields of scientific computing and artificial intelligence, is one of the basic operators of parallel computing, and is also the most important ensemble communication operator used in distributed deep learning. Therefore, it is important to realize the highly efficient, scalable and reliable Allreduce ensemble communication, which is important for improving the performance of computation-intensive applications such as distributed training [2] .…”
With the increasing demand for computing power in machine learning tasks, the training of deep neural network models has been pushed to multi-GPU training or even larger scale distributed training. However, the acceleration effect and scalability of model training are largely limited by the communication efficiency between GPUs. In order to improve the communication efficiency of domestic GPU accelerator, this paper studies and analyzes the communication performance of Allreduce operator widely used in deep learning tasks. Based on a data compression algorithm and multi-stream parallel technology, this paper ports and optimizes the Allreduce operator on domestic heterogeneous platform. The experimental data results show that compared with RCCL, the ported optimized version of Allreduce operator achieves different degrees of performance improvement such as 30% to 90% under different data sizes. This paper implements the transplantation and optimization of Allreduce operator on the domestic accelerated heterogeneous platform, and achieves good acceleration results, which provides support for the efficient communication of domestic GPU accelerators and the ecological construction of domestic heterogeneous platforms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.