Why random reshuffling beats stochastic gradient descent

Gürbüzbalaban, Mert; Ozdaglar, Asuman; Parrilo, Pablo A.

doi:10.1007/s10107-019-01440-w

Cited by 79 publications

(84 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, we argue that the privacy accounting method has to be chosen according to which data batching method is used. In our implementation, we focus on random reshuffling, because it is a common practice [5], [24] followed by existing deep learning implementations and it is also numerically observed that random reshuffling outperforms random sampling with replacement in the convergence rate [25]. Note: In our previous conference version we set P (q, σ) = q 2 /σ 2 directly based on the derivation from the log moment bound stated in the proof of Theorem 1 of [8] which misses an asymptotic term.…”

Section: B Refined Privacy Accountantmentioning

confidence: 99%

Differentially Private Model Publishing for Deep Learning

Liu

et al. 2019

2019 IEEE Symposium on Security and Privacy (SP)

206

177

View full text Add to dashboard Cite

Deep learning techniques based on neural networks have shown significant success in a wide range of AI tasks. Large-scale training datasets are one of the critical factors for their success. However, when the training datasets are crowdsourced from individuals and contain sensitive information, the model parameters may encode private information and bear the risks of privacy leakage. The recent growing trend of the sharing and publishing of pre-trained models further aggravates such privacy risks. To tackle this problem, we propose a differentially private approach for training neural networks. Our approach includes several new techniques for optimizing both privacy loss and model accuracy. We employ a generalization of differential privacy called concentrated differential privacy(CDP), with both a formal and refined privacy loss analysis on two different data batching methods. We implement a dynamic privacy budget allocator over the course of training to improve model accuracy. Extensive experiments demonstrate that our approach effectively improves privacy loss accounting, training efficiency and model quality under a given privacy budget.

show abstract

Section: B Refined Privacy Accountantmentioning

confidence: 99%

Differentially Private Model Publishing for Deep Learning

Liu

et al. 2019

2019 IEEE Symposium on Security and Privacy (SP)

206

177

View full text Add to dashboard Cite

show abstract

“…The scheme proposed in [3] uses a random uncoded storage (to fill users' extra memories independently when M > q) and a coded multicast transmission from the master to the workers, and yields a gain of a factor of O(K) in terms of communication load with respect to the naive 2 Notice that putting all nodes on the same bus (typical terminology in Compute Science) is very common and practically relevant since this is what happens for example with Ethernet, or with the Peripheral Component Interconnect Express (PCI Express) bus inside a multi-core computer, where all cores share a common bus for intercommunication. The access of such bus is regulated by some collision avoidance protocol such as Carrier Sense Multiple Access (CSMA) [4] or Token ring [5], such that once one node talks at a time, and all other listen.…”

Section: A Centralized Data Shufflingmentioning

confidence: 99%

“…To cope with such a large size/dimension of data and the complexity of machine learning algorithms, it is increasingly popular to use distributed computing platforms such as Amazon Web Services Cloud, Google Cloud, and Microsoft Azure services, where large scale distributed machine learning algorithms can be implemented. The approach of data shuffling has been identified as one of the core elements to improve the statistical performance of modern large scale machine learning algorithms [1], [2].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Fundamental Limits of Decentralized Data Shuffling

Wan

Tuninetti

et al. 2020

IEEE Trans. Inform. Theory

View full text Add to dashboard Cite

Data shuffling of training data among different computing nodes (workers) has been identified as a core element to improve the statistical performance of modern large scale machine learning algorithms.Data shuffling is often considered as one of the most significant bottlenecks in such systems due to the heavy communication load. Under a master-worker architecture (where a master has access to the entire dataset and only communication between the master and the workers is allowed) coding has been recently proved to considerably reduce the communication load. This work considers a different communication paradigm referred to as decentralized data shuffling, where workers are allowed to communicate with one another via a shared link. The decentralized data shuffling problem has two phases: workers communicate with each other during the data shuffling phase, and then workers update their stored content during the storage phase. For the case of uncoded storage (i.e., each worker directly A short version of this paper was presented scheme for which the master simply transmits the missing but required data to the workers by directly broadcasting the missing bits over the shared link.The centralized coded data shuffling scheme with coordinated (i.e., deterministic) uncoded storage update phase was originally proposed in [6], [7] to further reduce the communication load for the worst-case shuffles compared to [3]. The proposed schemes in [6], [7] are optimal under the constraint of uncoded storage for the cases where there is no extra memory for each worker (i.e., q = 1) or there are less than or equal to three workers in the systems. Inspired by the achievable and converse bounds for the single-bottleneck-link caching problem in [8]-[10], the authors in [11] then proposed a general coded data shuffling scheme, which was shown to be order optimality to within a factor of 2 under the constraint of uncoded storage. Also in [11], the authors improved the performance of the general coded shuffling scheme by introducing an aligned coded delivery, which was shown to be optimal under the constraint of uncoded storageRecently, inspired by the improved data shuffling scheme in [11], the authors in [12] proposed a linear coding scheme based on interference alignment, which achieves the optimal worstcase communication load under the constraint of uncoded storage for all system parameters. In addition, under the constraint of uncoded storage, the proposed coded data shuffling scheme in [12] was shown to be optimal for any shuffles (not just for the worst-case) when q = 1. B. Decentralized Data ShufflingAn important limitation of the centralized framework is the assumption that workers can only receive packets from the master. Since the entire data set is stored in a decentralized fashion across the workers at each epoch of the distributed learning algorithm, the master may not be needed in the data shuffling phase if workers can communicate with each other (e.g., [1]). In addition, the communication among workers can be much more ef...

show abstract

“…We also define L H := m i=1 L H,i as the Lipschitz constant for the Hessian of the sum function F (θ). In the above, the first assumption can be satisfied when Line 4 of Algorithm 3.1 is implemented with either cyclic function selection, i.e., i k = (k mod m) + 1, or a random shuffling at the beginning of every epoch [15]. The second and the last assumptions are standard and they can be satisfied by a number of functions relevant to machine learning applications, e.g., the logistic loss function.…”

Section: Convergence Analysismentioning

confidence: 99%

Curvature-aided incremental aggregated gradient method

Wai

Shi

Nedić

et al. 2017

2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton)

View full text Add to dashboard Cite

This paper studies an acceleration technique for incremental aggregated gradient methods which exploits curvature information for solving strongly convex finite sum optimization problems. These optimization problems of interest arise in large-scale learning applications relevant to machine learning systems. The proposed methods utilizes a novel curvature-aided gradient tracking technique to produce gradient estimates using the aids of Hessian information during computation. We propose and analyze two curvature-aided methods -the first method, called curvature-aided incremental aggregated gradient (CIAG) method, can be developed from the standard gradient method and it computes an ǫ-optimal solution using O(κ log(1/ǫ)) iterations for a small ǫ; the second method, called accelerated CIAG (A-CIAG) method, incorporates Nesterov's acceleration into CIAG and requires O( √ κ log(1/ǫ)) iterations for a small ǫ, where κ is the problem's condition number. Importantly, the asymptotic convergence rates above are the same as those of the full gradient and accelerated full gradient methods, respectively, and they are independent of the number of component functions involved. The proposed methods are significantly faster than the state-of-the-art methods, especially for large-scale problems with a massive amount of data.

show abstract

Why random reshuffling beats stochastic gradient descent

Cited by 79 publications

References 32 publications

Differentially Private Model Publishing for Deep Learning

Differentially Private Model Publishing for Deep Learning

Fundamental Limits of Decentralized Data Shuffling

Curvature-aided incremental aggregated gradient method

Contact Info

Product

Resources

About