2017
DOI: 10.48550/arxiv.1707.05589
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On the State of the Art of Evaluation in Neural Language Models

Abstract: Ongoing innovations in recurrent neural network architectures have provided a steady influx of apparently state-of-the-art results on language modelling benchmarks. However, these have been evaluated using differing codebases and limited computational resources, which represent uncontrolled sources of experimental variation. We reevaluate several popular architectures and regularisation methods with large-scale automatic black-box hyperparameter tuning and arrive at the somewhat surprising conclusion that stan… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
100
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 86 publications
(108 citation statements)
references
References 10 publications
1
100
0
Order By: Relevance
“…While this finding might seem trivial at first, it has far-reaching consequences, as it brings forward questions about the reproducibility of inferences about the mapping between brain activity and cognitive states that are drawn from the interpretation of cognitive decoding decisions of DL models. Recent empirical work in DL research has demonstrated that the convergence of DL models, and thereby the specifics of their learned mapping between input data and target signal, is dependent on many non-deterministic aspects of the training process, such as random seeds and random weight initializations (Dodge et al, 2019, Henderson et al, 2018, Lucic et al, 2018, Reimers and Gurevych, 2017 as well as the specific choices for other hyper-parameters, such as individual layer specifications and optimization algorithms Lucic et al (2018), Melis et al (2017), Zoph and Le (2017). It is thus possible that the mapping between cognitive states and brain activity that a DL model learns can change with these factors of the training.…”
Section: Discussionmentioning
confidence: 99%
“…While this finding might seem trivial at first, it has far-reaching consequences, as it brings forward questions about the reproducibility of inferences about the mapping between brain activity and cognitive states that are drawn from the interpretation of cognitive decoding decisions of DL models. Recent empirical work in DL research has demonstrated that the convergence of DL models, and thereby the specifics of their learned mapping between input data and target signal, is dependent on many non-deterministic aspects of the training process, such as random seeds and random weight initializations (Dodge et al, 2019, Henderson et al, 2018, Lucic et al, 2018, Reimers and Gurevych, 2017 as well as the specific choices for other hyper-parameters, such as individual layer specifications and optimization algorithms Lucic et al (2018), Melis et al (2017), Zoph and Le (2017). It is thus possible that the mapping between cognitive states and brain activity that a DL model learns can change with these factors of the training.…”
Section: Discussionmentioning
confidence: 99%
“…LSTMs [15] are a popular form of recurrent neural networks and serve as a well-known baseline for deep neural network models. Variants using LSTM remain competitive in various NLP tasks [22,25,29]. BiLSTM (Bi-directional LSTM) improves on the original LSTM by reading inputs in both forward and backward directions.…”
Section: Bi-directional Long Short Term Memorymentioning
confidence: 99%
“…For example, in [14], the authors question claimed advances in reinforcement learning research due to the lack of significance metrics and variability of results. In [24], the authors state that many years of claimed superiority in empirical performance in the field of language modeling is actually faulty and showcase that the well-known stacked LSTM architecture (with appropriate hyperparameter tuning) outperforms other more recent and more sophisticated architectures. In [26], the authors highlight a flaw in many previous research works (in the context of Bayesian deep learning) wherein a well established baseline (Monte Carlo dropout) when run to completion (i.e., when learning is not cut-off preemptively by setting it to terminate after a specified number of iterations), achieves similar or superior results compared to the very same models which showcased superior results when introduced.…”
Section: 'Pest' Antipatternmentioning
confidence: 99%