A hybrid GPU cluster and volunteer computing platform for scalable deep learning

Kijsipongse, Ekasit; Piyatumrong, Apivadee; U-ruekolan, Suriya

doi:10.1007/s11227-018-2375-9

Cited by 14 publications

(9 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we address byzantine-tolerant training in a setup where new participants can join or leave collaboration midway through training. This requirement arises naturally if a given training run relies on volunteers or an open pool of paid participants [13,14,15]. In addition to all existing concerns from Section 3, this new setup allows Byzantine attackers to assume new identity each time they are blocked.…”

Section: G Reputation System For Public Collaborationsmentioning

confidence: 99%

“…The first challenge is the sheer computational complexity of many machine learning tasks, such as pretraining transformers for NLP [7,8,9] or learning on huge datasets in vision [10,11,12]. Recent works propose several systems [13,14,15] that can share the computation across many volunteers that donate the idle time of their computers. Another challenge arises in Federated Learning, where participants train a shared model over decentralized data that cannot be shared for privacy reasons [16,17,18].…”

Section: Introductionmentioning

confidence: 99%

“…Despite their strengths, both volunteer computing and federated learning systems have so far seen limited practical applications [19,13]. A major roadblock towards the global adoption of these techniques is trust in reliability of each participant.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Secure Distributed Training at Scale

Gorbunov¹,

Alexander²,

Diskin³

et al. 2021

Preprint

View full text Add to dashboard Cite

Some of the hardest problems in deep learning can be solved with the combined effort of many independent parties, as is the case for volunteer computing and federated learning. These setups rely on high numbers of peers to provide computational resources or train on decentralized datasets. Unfortunately, participants in such systems are not always reliable. Any single participant can jeopardize the entire training run by sending incorrect updates, whether deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server. As a result, it can be infeasible to apply such algorithms to large-scale distributed deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency. We rigorously analyze this protocol: in particular, we provide theoretical bounds for its resistance against Byzantine and Sybil attacks and show that it has a marginal communication overhead. To demonstrate its practical effectiveness, we conduct large-scale experiments on image classification and language modeling in presence of Byzantine attackers.

show abstract

Section: G Reputation System For Public Collaborationsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Secure Distributed Training at Scale

Gorbunov¹,

Alexander²,

Diskin³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…By contrast, distributed training of a single model requires significantly more communication and does not allow a natural way to "restart" failed jobs. When it comes to distributed training of neural networks, most volunteer computing projects rely on parameter server architectures [71,72,73]. As a result, these systems are bounded by the throughput of parameter servers and the memory available on the weakest GPU.…”

Section: Volunteer Computingmentioning

confidence: 99%

Distributed Deep Learning in Open Collaborations

Diskin¹,

Bukhtiyarov²,

Ryabinin³

et al. 2021

Preprint

View full text Add to dashboard Cite

Modern deep learning applications require increasingly more compute to train state-of-the-art models. To address this demand, large corporations and institutions use dedicated High-Performance Computing clusters, whose construction and maintenance are both environmentally costly and well beyond the budget of most organizations. As a result, some research directions become the exclusive domain of a few large industrial and even fewer academic actors. To alleviate this disparity, smaller groups may pool their computational resources and run collaborative experiments that benefit all participants. This paradigm, known as grid-or volunteer computing, has seen successful applications in numerous scientific areas. However, using this approach for machine learning is difficult due to high latency, asymmetric bandwidth, and several challenges unique to volunteer computing. In this work, we carefully analyze these constraints and propose a novel algorithmic framework designed specifically for collaborative training. We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost. Finally, we provide a detailed report of successful collaborative language model pretraining with 40 participants.

show abstract

“…For the NVIDIA Tesla V100 Volta GPU used in this study, the single precision and double precision of floatingpoint operations are up to 14 and 7 TFLOP/s, respectively, which are far greater than the computing ability of CPU. GPU is widely used in general-purpose computing areas, such as molecular dynamics (MD) [12], direct simulation Monte Carlo (DSMC) [13], CFD, artificial intelligence (AI) [14], and deep learning (DL) [15]. Performing large-scale numerical calculations of CFD on GPUs is a research focus in the field of general computing, and a series of important results have been achieved.…”

Section: Introductionmentioning

confidence: 99%

Numerical investigation of supersonic transverse jet interaction on CPU/GPU system

Lai

Tian

et al. 2020

J Braz. Soc. Mech. Sci. Eng.

View full text Add to dashboard Cite

The main purpose of this paper is to develop a double-precision parallel algorithm implemented on graphics processing units (GPUs) for quick and accurate numerical simulations of large-scale supersonic transverse jet interaction problems. The finite volume method based on structured grid is considered; the AUSM + UP upwind scheme and the explicit multistage Runge-Kutta method are used for spatial discretization and time discretization, respectively. The turbulent solution is solved by K-ω SST two-equation model. Numerical investigation is performed for a supersonic missile body. Numerical results show that performing calculations on GPU can accurately capture the complex wave structures and vortex structures in the supersonic transverse jet flowfield. For single-GPU implementation, parallel computing can achieve an acceleration ratio of 90 times or more, and four GPU parallel computing can achieve an acceleration ratio of 106-245 times. Thus, GPU parallel computing can achieve a large-scale and efficient solution to supersonic transverse jet interaction problems.

show abstract

A hybrid GPU cluster and volunteer computing platform for scalable deep learning

Cited by 14 publications

References 32 publications

Secure Distributed Training at Scale

Secure Distributed Training at Scale

Distributed Deep Learning in Open Collaborations

Numerical investigation of supersonic transverse jet interaction on CPU/GPU system

Contact Info

Product

Resources

About