“…RELATED WORK Communication optimization for distributed training. Generally speaking, there are a wide range of approaches we can These include, but are not limited to: 1) using large mini-batch [16] and periodic communication [42] to reduce the communication rounds; 2) using gradient compression technique, e.g., gradient sparsification [34] and quantization [3], to reduce the taffic volume in each iteration; 3) relaxing the synchronization requirement [22], [23], [46]; 4) taking the intra-machine GPU topology into consideration [27]; 5) designing a parameter exchanging scheme considering the network topology [44]; 6) overlapping communication with computation [25], [38], [49]; 7) leveraging advanced communication library, e.g., ZMQ [21] and NCCL [26]; 8) exploiting fast network protocols, e.g., RDMA [17], [48]; 9) performing in-network aggregation to reduce the in-network traffic volume [6], [30], [39]; 10) minimizing network flow completion time by using congestion control [7], flow scheduling [4], [33] or coflow scheduling [12], [43], [50], [51]. We note that, while some of these methods have already been integrated into distributed DNN training systems, others remain to be explored in the future.…”