2016
DOI: 10.1016/j.sysarc.2015.10.003
|View full text |Cite
|
Sign up to set email alerts
|

A Distributed PTX Virtual Machine on Hybrid CPU/GPU Clusters

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(6 citation statements)
references
References 13 publications
0
6
0
Order By: Relevance
“…strengthening BSP [3], [5]- [11], [14], [16]- [18], the other is algorithmic optimizations replacing BSP [12], [13], [19]- [21]. However, neither of them can achieve the aforementioned two goals simultaneously in production clusters.…”
Section: B Existing Solutions and Their Drawbacksmentioning
confidence: 99%
See 1 more Smart Citation
“…strengthening BSP [3], [5]- [11], [14], [16]- [18], the other is algorithmic optimizations replacing BSP [12], [13], [19]- [21]. However, neither of them can achieve the aforementioned two goals simultaneously in production clusters.…”
Section: B Existing Solutions and Their Drawbacksmentioning
confidence: 99%
“…1) System-level Optimization: Due to its global synchronization nature, BSP can hardly eliminate the idle waiting of regular workers and links. Some system-level works employ various topologies like PS [5], [16], Ring [10], and double tree [14] to fully use network bandwidth. Some others explore different underlying network optimizations, including overlapping communication and computation [6]- [9], [17], [18], RDMA [22], [23], in-network aggregation [24]- [26], congestion control [27], [28], flow scheduling [29]- [31], and coflow scheduling [32]- [34].…”
Section: B Existing Solutions and Their Drawbacksmentioning
confidence: 99%
“…The layer-wise structure of deep learning training makes it convenient to parallel the communication tasks and computing tasks. The communication scheduling strategies mainly targets at minimizing the network communication time [18], [25], [27], [28], [31], [35], [38], [45]- [47], [49]. Poseidon [49] supports overlapping communication process with backward propagation, reducing bursty network communication.…”
Section: Communication Schedulingmentioning
confidence: 99%
“…RELATED WORK Communication optimization for distributed training. Generally speaking, there are a wide range of approaches we can These include, but are not limited to: 1) using large mini-batch [16] and periodic communication [42] to reduce the communication rounds; 2) using gradient compression technique, e.g., gradient sparsification [34] and quantization [3], to reduce the taffic volume in each iteration; 3) relaxing the synchronization requirement [22], [23], [46]; 4) taking the intra-machine GPU topology into consideration [27]; 5) designing a parameter exchanging scheme considering the network topology [44]; 6) overlapping communication with computation [25], [38], [49]; 7) leveraging advanced communication library, e.g., ZMQ [21] and NCCL [26]; 8) exploiting fast network protocols, e.g., RDMA [17], [48]; 9) performing in-network aggregation to reduce the in-network traffic volume [6], [30], [39]; 10) minimizing network flow completion time by using congestion control [7], flow scheduling [4], [33] or coflow scheduling [12], [43], [50], [51]. We note that, while some of these methods have already been integrated into distributed DNN training systems, others remain to be explored in the future.…”
Section: F Reconfiguration Overhead and Speedmentioning
confidence: 99%
“…The data and codes transmitted between the API library and the daemon are all encapsulated into messages. One message mainly contains two fields: FunctionID and PTX (Parallel Thread eXecution) code 20–22 . The FunctionID field of the header indicates which GPU call is made; the parameters of this call and the PTX code are encapsulated in the message body.…”
Section: Adaptive and Transparent Task Schedulingmentioning
confidence: 99%