A Distributed PTX Virtual Machine on Hybrid CPU/GPU Clusters

Liang, Tyng-Yeu; Li, Hung-Fu; Lin, Yujie; Chen, Bi-Shing

doi:10.1016/j.sysarc.2015.10.003

Cited by 9 publications

(6 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…strengthening BSP [3], [5]- [11], [14], [16]- [18], the other is algorithmic optimizations replacing BSP [12], [13], [19]- [21]. However, neither of them can achieve the aforementioned two goals simultaneously in production clusters.…”

Section: B Existing Solutions and Their Drawbacksmentioning

confidence: 99%

“…1) System-level Optimization: Due to its global synchronization nature, BSP can hardly eliminate the idle waiting of regular workers and links. Some system-level works employ various topologies like PS [5], [16], Ring [10], and double tree [14] to fully use network bandwidth. Some others explore different underlying network optimizations, including overlapping communication and computation [6]- [9], [17], [18], RDMA [22], [23], in-network aggregation [24]- [26], congestion control [27], [28], flow scheduling [29]- [31], and coflow scheduling [32]- [34].…”

Section: B Existing Solutions and Their Drawbacksmentioning

confidence: 99%

See 1 more Smart Citation

Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training

Wang

Zhang

Liu

et al. 2022

IEEE INFOCOM 2022 - IEEE Conference on Computer Communications

View full text Add to dashboard Cite

Bulk synchronous parallel (BSP) is the de-facto paradigm for distributed DNN training in today's production clusters. However, due to the global synchronization nature, its performance can be significantly influenced by network bottlenecks caused by either static topology heterogeneity or dynamic bandwidth contentions. Existing solutions, either system-level optimizations strengthening BSP (e.g., Ring or Hierarchical Allreduce) or algorithmic optimizations replacing BSP (e.g., ASP or SSP, which relax the global barriers), do not completely solve the problem, as they may still suffer from communication inefficiency or risk convergence inaccuracy.In this paper, we present a novel divide-and-shuffle synchronization (DS-Sync) to realize communication efficiency without sacrificing convergence accuracy for distributed DNN training. At its heart, by taking into account the network bottlenecks, DS-Sync improves communication efficiency by dividing workers into non-overlap groups to synchronize independently in a bottleneckfree manner. Meanwhile, it maintains convergence accuracy by iteratively shuffling workers among different groups to ensure a global consensus. We theoretically prove that DS-Sync converges properly in non-convex and smooth conditions like DNN. We further implement DS-Sync and integrate it with PyTorch, and our testbed experiments show that DS-Sync can achieve up to 94% improvements on the end-to-end training time with existing solutions while maintaining the same accuracy.

show abstract

Section: B Existing Solutions and Their Drawbacksmentioning

confidence: 99%

Section: B Existing Solutions and Their Drawbacksmentioning

confidence: 99%

Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training

Wang

Zhang

Liu

et al. 2022

IEEE INFOCOM 2022 - IEEE Conference on Computer Communications

View full text Add to dashboard Cite

show abstract

“…The layer-wise structure of deep learning training makes it convenient to parallel the communication tasks and computing tasks. The communication scheduling strategies mainly targets at minimizing the network communication time [18], [25], [27], [28], [31], [35], [38], [45]- [47], [49]. Poseidon [49] supports overlapping communication process with backward propagation, reducing bursty network communication.…”

Section: Communication Schedulingmentioning

confidence: 99%

“…RELATED WORK Communication optimization for distributed training. Generally speaking, there are a wide range of approaches we can These include, but are not limited to: 1) using large mini-batch [16] and periodic communication [42] to reduce the communication rounds; 2) using gradient compression technique, e.g., gradient sparsification [34] and quantization [3], to reduce the taffic volume in each iteration; 3) relaxing the synchronization requirement [22], [23], [46]; 4) taking the intra-machine GPU topology into consideration [27]; 5) designing a parameter exchanging scheme considering the network topology [44]; 6) overlapping communication with computation [25], [38], [49]; 7) leveraging advanced communication library, e.g., ZMQ [21] and NCCL [26]; 8) exploiting fast network protocols, e.g., RDMA [17], [48]; 9) performing in-network aggregation to reduce the in-network traffic volume [6], [30], [39]; 10) minimizing network flow completion time by using congestion control [7], flow scheduling [4], [33] or coflow scheduling [12], [43], [50], [51]. We note that, while some of these methods have already been integrated into distributed DNN training systems, others remain to be explored in the future.…”

Section: F Reconfiguration Overhead and Speedmentioning

confidence: 99%

AutoByte: Automatic Configuration for Optimal Communication Scheduling in DNN Training

Wang

Zhang

et al. 2022

IEEE INFOCOM 2022 - IEEE Conference on Computer Communications

View full text Add to dashboard Cite

ByteScheduler partitions and rearranges tensor transmissions to improve the communication efficiency of distributed Deep Neural Network (DNN) training. The configuration of hyper-parameters (i.e., the partition size and the credit size) is critical to the effectiveness of partitioning and rearrangement. Currently, ByteScheduler adopts Bayesian Optimization (BO) to find the optimal configuration for the hyper-parameters beforehand. In practice, however, various runtime factors (e.g., worker node status and network conditions) change over time, making the statically-determined one-shot configuration result suboptimal for real-world DNN training.To address this problem, we present a real-time configuration method (called AutoByte) that automatically and timely searches the optimal hyper-parameters as the training systems dynamically change. AutoByte extends the ByteScheduler framework with a meta-network, which takes the system's runtime statistics as its input and outputs predictions for speedups under specific configurations. Evaluation results on various DNN models show that AutoByte can dynamically tune the hyper-parameters with low resource usage, and deliver up to 33.2% higher performance than the best static configuration in ByteScheduler.

show abstract

“…The data and codes transmitted between the API library and the daemon are all encapsulated into messages. One message mainly contains two fields: FunctionID and PTX (Parallel Thread eXecution) code 20–22 . The FunctionID field of the header indicates which GPU call is made; the parameters of this call and the PTX code are encapsulated in the message body.…”

Section: Adaptive and Transparent Task Schedulingmentioning

confidence: 99%

Adaptive and transparent task scheduling of GPU‐powered clusters

Yang

et al. 2020

Concurrency and Computation

View full text Add to dashboard Cite

GPGPU-powered supercomputers are vital for various science and engineering applications. On each cluster node, the GPU works as a coprocessor of the CPU, and the computing task runs alternatively on CPU and GPU. Due to this characteristic, traditional task scheduling strategy tends to result in significant workload imbalance and underutilization of GPUs. We design an adaptive scheduling strategy to alleviate such imbalance and underutilization. Our strategy proposes to logically treats all GPUs in the cluster as a whole. Every cluster node maintains a local information table of all GPUs. Once a GPU call request is received, a node selects a GPU to run the task in an adaptive manner based on this table. In addition, our strategy does not rely on a global queue, and thus avoids excessive internode communication overhead. Moreover, we encapsulate our strategy into an intermedia module between the cluster and users. Consequently, underlying details of task scheduling is transparent to users, which enhances usability. We validate our strategy through experiments.

show abstract

A Distributed PTX Virtual Machine on Hybrid CPU/GPU Clusters

Cited by 9 publications

References 13 publications

Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training

Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training

AutoByte: Automatic Configuration for Optimal Communication Scheduling in DNN Training

Adaptive and transparent task scheduling of GPU‐powered clusters

Contact Info

Product

Resources

About