In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.1 Equivalently, we can let the server update its solution using the averaged gradient and broadcast this solution to all local workers. Another equivalent implementation is to let each worker take a single SGD step using its own gradient and send the updated local solution to the server; let the server calculate the average of all workers' updated solutions and refresh each worker's local solution with the averaged version. arXiv:1807.06629v3 [math.OC]
Abstract. This paper considers convex programs with a general (possibly non-differentiable) convex objective function and Lipschitz continuous convex inequality constraint functions. A simple algorithm is developed and achieves an O(1/t) convergence rate. Similar to the classical dual subgradient algorithm and the ADMM algorithm, the new algorithm has a parallel implementation when the objective and constraint functions are separable. However, the new algorithm has a faster O(1/t) convergence rate compared with the best known O(1/ √ t) convergence rate for the dual subgradient algorithm with primal averaging. Further, it can solve convex programs with nonlinear constraints, which cannot be handled by the ADMM algorithm. The new algorithm is applied to a multipath network utility maximization problem and yields a decentralized flow control algorithm with the fast O(1/t) convergence rate.
Recent developments on large-scale distributed machine learning applications, e.g., deep neural networks, benefit enormously from the advances in distributed non-convex optimization techniques, e.g., distributed Stochastic Gradient Descent (SGD). A series of recent works study the linear speedup property of distributed SGD variants with reduced communication. The linear speedup property enable us to scale out the computing capability by adding more computing nodes into our system. The reduced communication complexity is desirable since communication overhead is often the performance bottleneck in distributed systems. Recently, momentum methods are more and more widely adopted in training machine learning models and can often converge faster and generalize better. For example, many practitioners use distributed SGD with momentum to train deep neural networks with big data. However, it remains unclear whether any distributed momentum SGD possesses the same linear speedup property as distributed SGD and has reduced communication complexity. This paper fills the gap by considering a distributed communication efficient momentum SGD method and proving its linear speedup property.
√N T ) convergence, then it takes 1/(N 2 ) iterations, which is N times smaller than 1/ 2 , to attain an O( ) accurate solution. In this sense, the second algorithm is N times faster than the first one.
This paper considers online convex optimization with time-varying constraint functions. Specifically, we have a sequence of convex objective functions {ft(x)} ∞ t=0 and convex constraint functions {gt,i(x)} ∞ t=0 for i ∈ {1, ..., k}. The functions are gradually revealed over time. For a given ǫ > 0, the goal is to choose points xt every step t, without knowing the ft and gt,i functions on that step, to achieve a time average at most ǫ worse than the best fixed-decision that could be chosen with hindsight, subject to the time average of the constraint functions being nonpositive. It is known that this goal is generally impossible. This paper develops an online algorithm that solves the problem with O(1/ǫ 2 ) convergence time in the special case when all constraint functions are nonpositive over a common subset of R n . Similar performance is shown in an expected sense when the common subset assumption is removed but the constraint functions are assumed to vary according to a random process that is independent and identically distributed (i.i.d.) over time slots t ∈ {0, 1, 2, . . .}. Finally, in the special case when both the constraint and objective functions are i.i.d. over time slots t, the algorithm is shown to come within ǫ of optimality with respect to the best (possibly time-varying) causal policy that knows the full probability distribution.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.