“…DML algorithms, e.g., stochastic gradient descent (SGD), and Latent Dirichlet Allocation (LDA) [10,32], are iterative in nature, and are both computation and communication intensive ( §2). Over the years, a variety of DML systems [1,6,7,17] were developed to accelerate training by improving worker computation, e.g., via hardware accelerators [23,38], improving communication efficiency [2,3,42,48], and coordinating computation with communication [24,39,49,55]. These systems generally use one of two architectures: (a) Parameter Server (PS) [12,50], where, the model is stored at a separate location (server); in every iteration workers pull the latest model and compute an update, which is then shipped to the server and applied to the model.…”