2018
DOI: 10.1162/tacl_a_00018
|View full text |Cite
|
Sign up to set email alerts
|

Questionable Answers in Question Answering Research: Reproducibility and Variability of Published Results

Abstract: “Based on theoretical reasoning it has been suggested that the reliability of findings published in the scientific literature decreases with the popularity of a research field” (Pfeiffer and Hoffmann, 2009). As we know, deep learning is very popular and the ability to reproduce results is an important part of science. There is growing concern within the deep learning community about the reproducibility of results that are presented. In this paper we present a number of controllable, yet unreported, effects tha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
38
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 61 publications
(42 citation statements)
references
References 7 publications
(5 reference statements)
1
38
0
Order By: Relevance
“…One should further keep in mind an important caveat in interpreting the results in Table 3: As Reimers and Gurevych (2017) have discussed, non-determinism associated with training neural networks can yield significant differences in accuracy. Crane (2018) further demonstrated that for answer selection in question answering, a range of mundane issues such as software versions can have a significant impact on accuracy, and these effects can be larger than incremental improvements reported in the literature. We adopt the emerging best practice of reporting results from multiple trials, but this makes comparison to previous single-point results difficult.…”
Section: Resultsmentioning
confidence: 90%
“…One should further keep in mind an important caveat in interpreting the results in Table 3: As Reimers and Gurevych (2017) have discussed, non-determinism associated with training neural networks can yield significant differences in accuracy. Crane (2018) further demonstrated that for answer selection in question answering, a range of mundane issues such as software versions can have a significant impact on accuracy, and these effects can be larger than incremental improvements reported in the literature. We adopt the emerging best practice of reporting results from multiple trials, but this makes comparison to previous single-point results difficult.…”
Section: Resultsmentioning
confidence: 90%
“…In general, combining cross-pair and intra-pair similarities (with kernel sum or meta-classifiers) provides state-of-the-art results without using deep learning. Additionally, the outcome is de-terministic, while the DNN accuracy may vary depending on the type of the hardware used or the random initialization parameters (Crane, 2018). Tables 5, 6 and 7 report the performance of the most recent state-of-the-art systems on WikiQA, TREC13 and SemEval in comparison with our best results.…”
Section: Resultsmentioning
confidence: 99%
“…On WikiQA dataset, our method does not seem to be robust to structural hyperparameter changes. Crane (2018) mentions that on WikiQA dataset a neural matching model (Severyn and Moschitti, 2015) trained with different random seeds can result in differences up to 0.08 in MAP and MRR. We leave the further investigation of the high variance on the WikiQA dataset for further work.…”
Section: Discussionmentioning
confidence: 99%