Reproducibility issues in science, is
            <i>P</i>
            value really the only answer?

Gaudart, Jean; Huiart, Laëtitia; Milligan, Paul; Thiébaut, Rodolphe; Giorgi, Roch

doi:10.1073/pnas.1323051111

Cited by 17 publications

(14 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cumulative evidence often builds up from several studies with larger p-values that only when combined show clear evidence against the null hypothesis (Greenland et al 2016, p. 343). Very possibly, more stringent thresholds would lead to even more results being left unpublished, enhancing publication bias (Gaudart et al 2014;Gelman & Robert 2014). What we call winner's curse, truth inflation or inflated effect sizes will become even more severe with more stringent thresholds (Button et al 2013b).…”

Section: 'We Need More Stringent Decision Rules'mentioning

confidence: 99%

The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable research

Amrhein

Korner‐Nievergelt

Roth

2017

Preprint

View full text Add to dashboard Cite

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2921v2 | CC BY 4.0 Open Access |

show abstract

Section: 'We Need More Stringent Decision Rules'mentioning

confidence: 99%

The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable research

Amrhein

Korner‐Nievergelt

Roth

2017

Preprint

View full text Add to dashboard Cite

show abstract

“…For many years, however, and more so recently, the use of the term "significant" for findings that cross this somewhat arbitrary p-threshold is pointed out as detrimental to research and reliable findings [47,59]. Some authors have suggested a more stringent alpha level-for example, .005 instead of the traditional .05 (e.g., References [6,44])-yet this suggestion leads others to be concerned about inflated Type II errors [33]. At any rate, using the word "significant" can be misleading, and some scholars recommend to refrain from using this term altogether [41].…”

Section: P-values Significance and Effect Sizesmentioning

confidence: 99%

A Primer for Conducting Experiments in Human–Robot Interaction

Hoffman

Zhao

2020

J. Hum.-Robot Interact.

View full text Add to dashboard Cite

We provide guidelines for planning, executing, analyzing, and reporting hypothesis-driven experiments in Human-Robot Interaction (HRI). The intended audience are researchers in the field of HRI who are not trained in empirical research but who are interested in conducting rigorous human-participant studies to support their research. Following the chronological order of research activities and grounded in updated research practices in psychological and behavioral sciences, this primer covers recommended methods and common pitfalls for defining research questions, identifying constructs and hypotheses, choosing appropriate study designs, operationalizing constructs as variables, planning and executing studies, sampling, choosing statistical tools for data analysis, and reporting results. CCS Concepts: • Human-centered computing → Empirical studies in HCI; • Computer systems organization → Embedded systems; Robotics;

show abstract

“…Conversely, a researcher might obtain a result with low significance (high p-value) with a large effect size measure (or an important/practical result). There is a tendency to overvalue significance and to ignore effect size (see early comment in archaeology by Thomas, 1978:233; see a recent discussion relevant to science by Gaudart et al, 2014), which relates to singular focus on the role of the p-value in hypothesis testing (a multitude of examples could be cited here, but to do so would target case studies and authors; see Wolverton (2005) and Wolverton et al (2008) for self-critical examples). We return to effect size and its interpretation in the Analyse effect size to determine 'practical significance' section later.…”

Section: Statistical Hypothesis Testingmentioning

confidence: 99%

Practical Significance: Ordinal Scale Data and Effect Size in Zooarchaeology

Wolverton

Dombrosky

Lyman

2014

Intl J of Osteoarchaeology

View full text Add to dashboard Cite

Quantitative analysis of zooarchaeological taxonomic abundances and skeletal part frequencies often relies on parametric techniques to test hypotheses. Data upon which such analyses are based are considered by some to be 'ordinal scale at best', meaning that non-parametric approaches may be better suited for addressing hypotheses. An important consideration is that archaeologists do not directly or randomly sample target populations of artefacts and faunal remains, which means that sampling error is not randomly generated. Thus, use of inferential statistics is potentially suspect. A solution to this problem is to rely on a weight of evidence research strategy and to limit analysis to descriptive statistics. Alternatively, if one chooses to use statistical inference, one should analyse effect size to determine practical significance of results and adopt conservative, robust inferential tests that require relatively few assumptions. Archaeologists may choose not to abandon statistical inference, but if so, they should temper how they use statistical tools.

show abstract

Reproducibility issues in science, is P value really the only answer?

Cited by 17 publications

References 5 publications

The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable research

The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable research

A Primer for Conducting Experiments in Human–Robot Interaction

Practical Significance: Ordinal Scale Data and Effect Size in Zooarchaeology

Contact Info

Product

Resources

About