2019
DOI: 10.1007/s10107-019-01440-w
|View full text |Cite
|
Sign up to set email alerts
|

Why random reshuffling beats stochastic gradient descent

Abstract: We analyze the convergence rate of the random reshuffling (RR) method, which is a randomized first-order incremental algorithm for minimizing a finite sum of convex component functions. RR proceeds in cycles, picking a uniformly random order (permutation) and processing the component functions one at a time according to this order, i.e., at each cycle, each component function is sampled without replacement from the collection. Though RR has been numerically observed to outperform its with-replacement counterpa… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
81
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 79 publications
(84 citation statements)
references
References 32 publications
3
81
0
Order By: Relevance
“…Therefore, we argue that the privacy accounting method has to be chosen according to which data batching method is used. In our implementation, we focus on random reshuffling, because it is a common practice [5], [24] followed by existing deep learning implementations and it is also numerically observed that random reshuffling outperforms random sampling with replacement in the convergence rate [25]. Note: In our previous conference version we set P (q, σ) = q 2 /σ 2 directly based on the derivation from the log moment bound stated in the proof of Theorem 1 of [8] which misses an asymptotic term.…”
Section: B Refined Privacy Accountantmentioning
confidence: 99%
“…Therefore, we argue that the privacy accounting method has to be chosen according to which data batching method is used. In our implementation, we focus on random reshuffling, because it is a common practice [5], [24] followed by existing deep learning implementations and it is also numerically observed that random reshuffling outperforms random sampling with replacement in the convergence rate [25]. Note: In our previous conference version we set P (q, σ) = q 2 /σ 2 directly based on the derivation from the log moment bound stated in the proof of Theorem 1 of [8] which misses an asymptotic term.…”
Section: B Refined Privacy Accountantmentioning
confidence: 99%
“…The scheme proposed in [3] uses a random uncoded storage (to fill users' extra memories independently when M > q) and a coded multicast transmission from the master to the workers, and yields a gain of a factor of O(K) in terms of communication load with respect to the naive 2 Notice that putting all nodes on the same bus (typical terminology in Compute Science) is very common and practically relevant since this is what happens for example with Ethernet, or with the Peripheral Component Interconnect Express (PCI Express) bus inside a multi-core computer, where all cores share a common bus for intercommunication. The access of such bus is regulated by some collision avoidance protocol such as Carrier Sense Multiple Access (CSMA) [4] or Token ring [5], such that once one node talks at a time, and all other listen.…”
Section: A Centralized Data Shufflingmentioning
confidence: 99%
“…To cope with such a large size/dimension of data and the complexity of machine learning algorithms, it is increasingly popular to use distributed computing platforms such as Amazon Web Services Cloud, Google Cloud, and Microsoft Azure services, where large scale distributed machine learning algorithms can be implemented. The approach of data shuffling has been identified as one of the core elements to improve the statistical performance of modern large scale machine learning algorithms [1], [2].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…We also define L H := m i=1 L H,i as the Lipschitz constant for the Hessian of the sum function F (θ). In the above, the first assumption can be satisfied when Line 4 of Algorithm 3.1 is implemented with either cyclic function selection, i.e., i k = (k mod m) + 1, or a random shuffling at the beginning of every epoch [15]. The second and the last assumptions are standard and they can be satisfied by a number of functions relevant to machine learning applications, e.g., the logistic loss function.…”
Section: Convergence Analysismentioning
confidence: 99%