2014
DOI: 10.1109/tmm.2014.2330697
|View full text |Cite
|
Sign up to set email alerts
|

A Simple Method to Determine if a Music Information Retrieval System is a “Horse”

Abstract: We propose and demonstrate a simple method to explain the figure of merit (FoM) of a music information retrieval (MIR) system evaluated in a dataset, specifically, whether the FoM comes from the system using characteristics confounded with the "ground truth" of the dataset. Akin to the controlled experiments designed to test the supposed mathematical ability of the famous horse "Clever Hans," we perform two experiments to show how three state-of-the-art MIR systems produce excellent FoM in spite of not using m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

2
89
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
5
3
2

Relationship

2
8

Authors

Journals

citations
Cited by 99 publications
(91 citation statements)
references
References 31 publications
2
89
0
Order By: Relevance
“…This differs from applications such as audio archive analysis, for which a system must be robust to signal modifications induced by variation of microphones and preprocessing across the dataset [36]. For embodied machine listening, aspects such as the microphone frequency response will be constant factors rather than random factors.…”
Section: A Requirements Gatheringmentioning
confidence: 99%
“…This differs from applications such as audio archive analysis, for which a system must be robust to signal modifications induced by variation of microphones and preprocessing across the dataset [36]. For embodied machine listening, aspects such as the microphone frequency response will be constant factors rather than random factors.…”
Section: A Requirements Gatheringmentioning
confidence: 99%
“…This motivates the second contribution of our work: even with very good performance in these "proxy" evaluations, caution must be taken when discussing what these systems have actually learned to do. Even though a model may appear to be doing the right things, it may be working with concepts that are not very general (Sturm, 2014;Sturm and Ben-Tal, 2017). For instance, the folk-rnn models seem to be able to count time and repeat and vary material in ways that are stylistically plausible, but these abilities disappears when the models are pushed even a little outside of its training material.…”
Section: Informing the Research Pursuit Of Machine Learningmentioning
confidence: 99%
“…temporal or spectral features) to high-level semantic labels using manually pre-labeled training samples [8]- [16]. The task, however, remains challenging due to the following three issues: the scarcity of well-labeled training data [17], [18], the complexity involved in formalizing and evaluating the task while taking care of possible confounds [18], [19], and the difficulty of extracting good audio features that capture the characteristics of each tag [20]- [24]. Good feature design is hard to come by, for example for tags that are social and cultural constructs Manuscript (e.g.…”
Section: Introductionmentioning
confidence: 99%