Proceedings of the 11th ACM Symposium on Cloud Computing 2020
DOI: 10.1145/3419111.3421296
|View full text |Cite
|
Sign up to set email alerts
|

Network-accelerated distributed machine learning for multi-tenant settings

Abstract: Many distributed machine learning (DML) workloads are increasingly being run in shared clusters. Training in such clusters can be impeded by unexpected compute and network contention, resulting in stragglers. We present MLfabric, a contention-aware DML system that manages the performance of a DML job running in a shared cluster. The DML application hands all network communication (gradient and model transfers) to the MLfabric communication library. MLfabric then carefully orders transfers to improve convergenc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(4 citation statements)
references
References 19 publications
0
4
0
Order By: Relevance
“…Specifically for such ML tasks, network performance has been noted as a major bottleneck hindering the efficient usage of such frameworks [3], [35]. Various approaches have been suggested to modify ML methodologies in order to improve upon the network induced performance of distributed ML [4], [36], [37].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Specifically for such ML tasks, network performance has been noted as a major bottleneck hindering the efficient usage of such frameworks [3], [35]. Various approaches have been suggested to modify ML methodologies in order to improve upon the network induced performance of distributed ML [4], [36], [37].…”
Section: Related Workmentioning
confidence: 99%
“…As online applications and services increase in popularity, distributed data processing capabilities and datacenter networks have become a major part of the infrastructure of modern society. Moreover, due to the vast growth in the amount of data processed by such applications, recent work shows that the bottleneck for efficient distributed computation is now the underlying communication network and not the computational capabilities at the servers [1]- [3], as was traditionally the case.…”
Section: Introductionmentioning
confidence: 99%
“…Efficiently performing distributed machine learning, and specifically the task of training deep neural networks, has been a fundamental concern in the past decade. In particular, network bottlenecks are arguably one of the major concerns when executing such tasks [29,53]. Various methods for improving network performance and footprint in such systems have been proposed and implemented, including sparsification, quantization, and scheduling [17,54,58].…”
Section: Related Workmentioning
confidence: 99%
“…Datacenter networks and their distributed data processing capabilities are the driving force behind leading applications and services, including search engines, content distribution, social networks and eCommerce. Recent work has shown that for many of the tasks performed by such applications, the network (and not server computation) is the actual bottleneck hindering the ability to optimize computation efficiency and performance [13,35,53]. Such bottlenecks occur, e.g., in distributed and federated machine learning (e.g., AllReduce), and in solutions employing the MapReduce methodology for big data tasks, and more generally in scenarios giving rise to the incast problem [5,56].…”
Section: Introductionmentioning
confidence: 99%