Over‐optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results

Nießl, Christina; Herrmann, Moritz; Wiedemann, Christine; Casalicchio, Giuseppe; Boulesteix, Anne‐Laure

doi:10.1002/widm.1441

Cited by 19 publications

(32 citation statements)

References 63 publications

(134 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are many possible options of dealing with missing data, the most fundamental decision being whether missing data should be deleted or replaced by plausible values -a technique also known as imputation. If a researcher decides to impute missing data, they can choose from a plethora of imputation methods, all leading to slightly different replacement values that can influence the results of statistical hypothesis tests (Nießl et al, 2021).…”

Section: (10) Favorable Imputationmentioning

confidence: 99%

Big Little Lies: A Compendium and Simulation of p-Hacking Strategies

Stefan¹,

Schönbrodt²

2022

Preprint

View full text Add to dashboard Cite

In many research fields, the widespread use of questionable research practices has jeopardized the credibility of scientific results. One of the most prominent questionable research practices is p-hacking. Typically, p-hacking is defined as a compound of strategies targeted at rendering non-significant hypothesis testing results significant. However, a comprehensive overview of these p-hacking strategies is missing, and current meta-scientific research often ignores the heterogeneity of strategies. Here, we compile a list of twelve p-hacking strategies based on an extensive literature review, identify factors that control their level of severity, and demonstrate their impact on false-positive rates using simulation studies. We also use our simulation results to evaluate several approaches that have been proposed to mitigate the influence of questionable research practices. Our results show that investigating p-hacking at the level of strategies can provide a better understanding of the process of p-hacking, as well as a broader basis for developing effective countermeasures. By making our analyses available through a Shiny app and R package, we facilitate future meta-scientific research aimed at investigating the ramifications of p-hacking across multiple strategies, and we hope to start a broader discussion about different manifestations of p-hacking in practice.

show abstract

Section: (10) Favorable Imputationmentioning

confidence: 99%

Big Little Lies: A Compendium and Simulation of p-Hacking Strategies

Stefan¹,

Schönbrodt²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Here, several approaches exist. Most commonly, the methods are ranked according to their performance and results are presented as summaries of this ranking, see Nießl et al (2021) for a detailed discussion. As pointed out by Boulesteix et al (2013), the concepts of meta-analysis could also be extended for the framework of method comparison studies.…”

Section: Discussionmentioning

confidence: 99%

“…Recently, it has been noted in the context of data analysis that there is a tendency to over-optimistic reporting of the performance of new methods and a lack of neutral comparison studies in the literature, see e.g. Boulesteix (2015); Boulesteix et al (2017Boulesteix et al ( , 2013; Van Mechelen et al (2018); Weber et al (2019); Buchka et al (2021); Nießl et al (2021); Pawel et al (2022). Neutral comparison studies, however, are essential to guarantee a fair comparison of existing methods across different scenarios, thus allowing an applied researcher to determine the best method for her or his situation.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On the role of benchmarking data sets and simulations in method comparison studies

Friedrich¹,

Friede²

2022

Preprint

View full text Add to dashboard Cite

Method comparisons are essential to provide recommendations and guidance for applied researchers, who often have to choose from a plethora of available approaches. While many comparisons exist in the literature, these are often not neutral but favour a novel method. Apart from the choice of design and a proper reporting of the findings, there are different approaches concerning the underlying data for such method comparison studies. Most manuscripts on statistical methodology rely on simulation studies and provide a single real-world data set as an example to motivate and illustrate the methodology investigated. In the context of supervised learning, in contrast, methods are often evaluated using so-called benchmarking data sets, i.e. real-world data that serve as gold standard in the community. Simulation studies, on the other hand, are much less common in this context. The aim of this paper is to investigate differences and similarities between these approaches, to discuss their advantages and disadvantages and ultimately to develop new approaches to the evaluation of methods picking the best of both worlds. To this aim, we borrow ideas from different contexts such as mixed methods research and Clinical Scenario Evaluation.

show abstract

“…The goal of this study is to provide an overview of existing approaches for encoding categorical predictor variables and to study their effect on a model's predictive performance. Following calls in the computational statistics community for neutral benchmark studies (Boulesteix et al 2017), which do not introduce a new method, thus reducing the risk of cherry picking methods (Dehghani et al 2021) and reporting over-optimistic performance (Nießl et al 2021), we present a carefully designed experimental setting to discern the effect of encoding strategies and their interaction with different ML algorithms.…”

Section: Introductionmentioning

confidence: 99%

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Pargent

Pfisterer²,

Thomas³

et al. 2022

Comput Stat

View full text Add to dashboard Cite

Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm’s predictive performance, and—if possible—derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass–classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison.

show abstract

Over‐optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results

Cited by 19 publications

References 63 publications

Big Little Lies: A Compendium and Simulation of p-Hacking Strategies

Big Little Lies: A Compendium and Simulation of p-Hacking Strategies

On the role of benchmarking data sets and simulations in method comparison studies

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Contact Info

Product

Resources

About