Learning music audio representations via weak language supervision

Manco, Ilaria; Benetos, Emmanouil; Quinton, Elio; Fazekas, György

doi:10.48550/arxiv.2112.04214

Cited by 1 publication

(1 citation statement)

References 9 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several recent works have emerged in this domain, proposing methods to automatically generate music descriptions [21,7,9], synthesise music from a text prompt [2,14,30,6], search for music based on language queries [8,22,13], and more [20,17]. However, evaluating M&L models remains a challenge due to a lack of public and accessible datasets with paired audio and language, resulting in the widespread use of private data [21,22,23,14,2,13] and inconsistent evaluation practices. To mitigate this, we release the Song Describer dataset (SDD), a new high-quality evaluation dataset of crowdsourced captions paired with openly licensed music recordings.…”

Section: Introductionmentioning

confidence: 99%

MusCaps: Generating Captions for Music Audio

Manco

Benetos

Quinton³

et al. 2021

2021 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-andlanguage models. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-tomusic generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance.

show abstract