Accelerating deep neural network training with inconsistent stochastic gradient descent

Wang, Linnan; Yang, Yi; Min, Martin Renqiang; Chakradhar, Srimat

doi:10.1016/j.neunet.2017.06.003

Cited by 82 publications

(42 citation statements)

References 21 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast to BSP, ASP has the least training time for a given number of epochs, but usually yields a much lower accuracy due to the missing synchronization step among workers. Nonetheless, zero or less synchronization for workers usually diffuses the convergence of a DNN model [22]. Hence, ASP is not stable in terms of the model convergence.…”

Section: A Distributed Paradigms For Updating the Parametersmentioning

confidence: 99%

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Zhao

Liu

et al. 2019

2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS)

View full text Add to dashboard Cite

Deep learning is a popular machine learning technique and has been applied to many real-world problems, ranging from computer vision to natural language processing. However, training a deep neural network is very timeconsuming, especially on big data. It has become difficult for a single machine to train a large model over large datasets. A popular solution is to distribute and parallelize the training process across multiple machines using the parameter server framework. In this paper, we present a distributed paradigm on the parameter server framework called Dynamic Stale Synchronous Parallel (DSSP) which improves the state-of-theart Stale Synchronous Parallel (SSP) paradigm by dynamically determining the staleness threshold at the run time. Conventionally to run distributed training in SSP, the user needs to specify a particular stalenes threshold as a hyper-parameter. However, a user does not usually know how to set the threshold and thus often finds a threshold value through trial and error, which is time-consuming. Based on workers' recent processing time, our approach DSSP adaptively adjusts the threshold per iteration at running time to reduce the waiting time of faster workers for synchronization of the globally shared parameters (the weights of the model), and consequently increases the frequency of parameters updates (increases iteration throughput), which speedups the convergence rate. We compare DSSP with other paradigms such as Bulk Synchronous Parallel (BSP), Asynchronous Parallel (ASP), and SSP by running deep neural networks (DNN) models over GPU clusters in both homogeneous and heterogeneous environments. The results show that in a heterogeneous environment where the cluster consists of mixed models of GPUs, DSSP converges to a higher accuracy much earlier than SSP and BSP and performs similarly to ASP. In a homogeneous distributed cluster, DSSP has more stable and slightly better performance than SSP and ASP, and converges much faster than BSP.

show abstract

Section: A Distributed Paradigms For Updating the Parametersmentioning

confidence: 99%

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Zhao

Liu

et al. 2019

2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS)

View full text Add to dashboard Cite

show abstract

“…Algorithms like Stochastic Variance Reduced Gradient (SVRG) method [5] and related approaches [6] mix SGD-like steps with some batch computations to control the stochastic noise. Others have proposed to parallelize stochastic training through large mini-batches [7].…”

Section: Analysis Of Related Researchmentioning

confidence: 99%

Impact of Training Set Batch Size on the Performance of Convolutional Neural Networks for Diverse Datasets

Радюк

2017

Information Technology and Management Science

131

View full text Add to dashboard Cite

-A problem of improving the performance of convolutional neural networks is considered. A parameter of the training set is investigated. The parameter is the batch size. The goal is to find an impact of training set batch size on the performance. To get consistent results, diverse datasets are used. They are MNIST and CIFAR-10. Simplicity of the MNIST dataset stands against complexity of the CIFAR-10 dataset, although the simpler dataset has 10 classes as well as the more complicated one. To achieve acceptable testing results, various convolutional neural network architectures are selected for the MNIST and CIFAR-10 datasets, with two and five convolutional layers, respectively. The assumption about the dependence of the recognition accuracy on the batch size value is confirmed: the larger the batch size value, the higher the recognition accuracy. Another assumption about the impact of the type of the batch size value on the CNN performance is not confirmed.

show abstract

“…In the experiment, we used the Stochastic Gradient Descent (SGD)) algorithm [45] as a model update method. The SGD optimizer has several parameter settings: the initial learning rate is 0.001, and then the learning rate is decreased by ten percent in every five epochs.…”

Section: Network Structurementioning

confidence: 99%

Dual-branch residual network for lung nodule segmentation

Cao

Liu

Hung

et al. 2020

Applied Soft Computing

View full text Add to dashboard Cite

An accurate segmentation of lung nodules in computed tomography (CT) images is critical to lung cancer analysis and diagnosis. However, due to the variety of lung nodules and the similarity of visual characteristics between nodules and their surroundings, a robust segmentation of nodules becomes a challenging problem. In this study, we propose the Dual-branch Residual Network (DB-ResNet) which is a data-driven model. Our approach integrates two new schemes to improve the generalization capability of the model: 1) the proposed model can simultaneously capture multi-view and multi-scale features of different nodules in CT images; 2) we combine the features of the intensity and the convolution neural networks (CNN). We propose a pooling method, called the central intensity-pooling layer (CIP), to extract the intensity features of the center voxel of the block, and then use the CNN to obtain the convolutional features of the center voxel of the block. In addition, we designed a weighted sampling strategy based on the boundary of nodules for the selection of those voxels using the weighting score, to increase the accuracy of the model. The proposed method has been extensively evaluated on the LIDC dataset containing 986 nodules. Experimental results show that the DB-ResNet achieves superior segmentation performance with an average dice score of 82.74% on the dataset. Moreover, we compared our results with those of four radiologists on the same dataset.The comparison showed that our average dice score was 0.49% higher than that of human experts. This proves that our proposed method is as good as the experienced radiologist.

show abstract

Accelerating deep neural network training with inconsistent stochastic gradient descent

Cited by 82 publications

References 21 publications

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Impact of Training Set Batch Size on the Performance of Convolutional Neural Networks for Diverse Datasets

Dual-branch residual network for lung nodule segmentation

Contact Info

Product

Resources

About