2017
DOI: 10.1016/j.neunet.2017.06.003
|View full text |Cite
|
Sign up to set email alerts
|

Accelerating deep neural network training with inconsistent stochastic gradient descent

Abstract: Stochastic Gradient Descent (SGD) updates network parameters with a noisy gradient computed from a random batch, and each batch evenly updates the network once in an epoch. This model applies the same training effort to each batch, but it overlooks the fact that the gradient variance, induced by Sampling Bias and Intrinsic Image Difference, renders different training dynamics on batches. In this paper, we develop a new training strategy for SGD, referred to as Inconsistent Stochastic Gradient Descent (ISGD) to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
42
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 82 publications
(42 citation statements)
references
References 21 publications
(28 reference statements)
0
42
0
Order By: Relevance
“…In contrast to BSP, ASP has the least training time for a given number of epochs, but usually yields a much lower accuracy due to the missing synchronization step among workers. Nonetheless, zero or less synchronization for workers usually diffuses the convergence of a DNN model [22]. Hence, ASP is not stable in terms of the model convergence.…”
Section: A Distributed Paradigms For Updating the Parametersmentioning
confidence: 99%
“…In contrast to BSP, ASP has the least training time for a given number of epochs, but usually yields a much lower accuracy due to the missing synchronization step among workers. Nonetheless, zero or less synchronization for workers usually diffuses the convergence of a DNN model [22]. Hence, ASP is not stable in terms of the model convergence.…”
Section: A Distributed Paradigms For Updating the Parametersmentioning
confidence: 99%
“…Algorithms like Stochastic Variance Reduced Gradient (SVRG) method [5] and related approaches [6] mix SGD-like steps with some batch computations to control the stochastic noise. Others have proposed to parallelize stochastic training through large mini-batches [7].…”
Section: Analysis Of Related Researchmentioning
confidence: 99%
“…In the experiment, we used the Stochastic Gradient Descent (SGD)) algorithm [45] as a model update method. The SGD optimizer has several parameter settings: the initial learning rate is 0.001, and then the learning rate is decreased by ten percent in every five epochs.…”
Section: Network Structurementioning
confidence: 99%