2017
DOI: 10.1515/pralin-2017-0005
|View full text |Cite
|
Sign up to set email alerts
|

Empirical Investigation of Optimization Algorithms in Neural Machine Translation

Abstract: Training neural networks is a non-convex and a high-dimensional optimization problem. In this paper, we provide a comparative study of the most popular stochastic optimization techniques used to train neural networks. We evaluate the methods in terms of convergence speed, translation quality, and training stability. In addition, we investigate combinations that seek to improve optimization in terms of these aspects. We train state-of-the-art attention-based models and apply them to perform neural machine trans… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
15
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 18 publications
(17 citation statements)
references
References 6 publications
2
15
0
Order By: Relevance
“…We have three variants of our model, using: (i) only the source memory (S-NMT+src mem), (ii) only the target memory (S-NMT+trg mem), or 5 In our initial experiments, we found SGD to be more effective than Adam/Adagrad; an observation also made by Bahar et al (2017). 6 For the document NMT model training, we did some preliminary experiments using different learning rates and used the scheme which converged to the best perplexity in the least number of epochs while for sentence-level training we follow Cohn et al (2016).…”
Section: Resultsmentioning
confidence: 91%
“…We have three variants of our model, using: (i) only the source memory (S-NMT+src mem), (ii) only the target memory (S-NMT+trg mem), or 5 In our initial experiments, we found SGD to be more effective than Adam/Adagrad; an observation also made by Bahar et al (2017). 6 For the document NMT model training, we did some preliminary experiments using different learning rates and used the scheme which converged to the best perplexity in the least number of epochs while for sentence-level training we follow Cohn et al (2016).…”
Section: Resultsmentioning
confidence: 91%
“…Note that the baseline in this work is much stronger than in our prior work ( >5 BLEU). This is due to multiple factors that have been recommended as best practices for neural MT and have been incorporated in the present baseline -deduplication of the training data, ensemble decoding using multiple random runs, use of Adam as the optimizer instead of AdaDelta (Bahar et al, 2017;Denkowski and Neubig, 2017), and checkpoint averaging (Bahar et al, 2017) -as well as a more recent neural modeling toolkit.…”
Section: Neural Mt Systemmentioning
confidence: 99%
“…So et al (2019) apply NAS to Transformer on NMT tasks. There is also work on empirically exploring hyperparameters and architectures of NMT systems (Bahar et al, 2017;Britz et al, 2017;Lim et al, 2018), though the focus is on finding general best-practice configurations. This differs from the goal of HPO, which aims to find the best configuration specific to a given dataset.…”
Section: Related Workmentioning
confidence: 99%