On a Comprehensive Metadata Framework for Artificial Data in Unsupervised Learning

Dangl, Rainer; Leisch, Friedrich

doi:10.5445/ksp/1000058749/22

Cited by 1 publication

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A blueprint of a novel device for simulating data for benchmarking in unsupervised learning has been designed by Dangl and Leisch (2017). This blueprint comprises the plan of a web repository and an accompanying R (R Core Team, 2017) package for the actual production of metadata objects and for the subsequent generation of data sets on a local computer.…”

Section: Issuesmentioning

confidence: 99%

“…• In case of simulated data, organize a fair comparison in terms of the relation between the methods under study and the data-generating mechanisms of the simulations, with fair meaning that one should not exclusively rely on mechanisms that unilaterally favor methods which explicitly or implicitly assume that these mechanisms are in place. • Disclose full information on the data sets that are used (making use, whenever meaningful, of platforms such as GitHub or Gitlab) in view of reproducibility (Dangl & Leisch, 2017;Donoho, 2010;Hofner et al, 2016;Peng, 2011) and of enabling follow-up research. This means that: for simulated data sets, provide implementable data-generating code with full information on cluster-specific parameters, the data-generating function, random seeds, the type and version of the random number generator, and so on; for empirical data sets, provide the full data sets, with sufficient detail on format, codes used to denote missing values, pre-processing, and so on.…”

Section: Recommendationsmentioning

confidence: 99%

See 1 more Smart Citation

A white paper on good research practices in benchmarking: The case of cluster analysis

Mechelen

Boulesteix²,

Dangl

et al. 2023

WIREs Data Min & Knowl

Self Cite

View full text Add to dashboard Cite

To achieve scientific progress in terms of building a cumulative body of knowledge, careful attention to benchmarking is of the utmost importance, requiring that proposals of new methods are extensively and carefully compared with their best predecessors, and existing methods subjected to neutral comparison studies. Answers to benchmarking questions should be evidence‐based, with the relevant evidence being collected through well‐thought‐out procedures, in reproducible and replicable ways. In the present paper, we review good research practices in benchmarking from the perspective of the area of cluster analysis. Discussion is given to the theoretical, conceptual underpinnings of benchmarking based on simulated and empirical data in this context. Subsequently, the practicalities of how to address benchmarking questions in clustering are dealt with, and foundational recommendations are made based on existing literature.This article is categorized under: Fundamental Concepts of Data and Knowledge > Data Concepts Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining Technologies > Structure Discovery and Clustering

show abstract

Section: Issuesmentioning

confidence: 99%