2021
DOI: 10.48550/arxiv.2112.04214
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning music audio representations via weak language supervision

Abstract: Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flexibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the question of whether it may be possible to exploit weakly aligned text as the only supervisory signal to learn general-purpose music audio representations. To address thi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 9 publications
(12 reference statements)
0
1
0
Order By: Relevance
“…Several recent works have emerged in this domain, proposing methods to automatically generate music descriptions [21,7,9], synthesise music from a text prompt [2,14,30,6], search for music based on language queries [8,22,13], and more [20,17]. However, evaluating M&L models remains a challenge due to a lack of public and accessible datasets with paired audio and language, resulting in the widespread use of private data [21,22,23,14,2,13] and inconsistent evaluation practices. To mitigate this, we release the Song Describer dataset (SDD), a new high-quality evaluation dataset of crowdsourced captions paired with openly licensed music recordings.…”
Section: Introductionmentioning
confidence: 99%
“…Several recent works have emerged in this domain, proposing methods to automatically generate music descriptions [21,7,9], synthesise music from a text prompt [2,14,30,6], search for music based on language queries [8,22,13], and more [20,17]. However, evaluating M&L models remains a challenge due to a lack of public and accessible datasets with paired audio and language, resulting in the widespread use of private data [21,22,23,14,2,13] and inconsistent evaluation practices. To mitigate this, we release the Song Describer dataset (SDD), a new high-quality evaluation dataset of crowdsourced captions paired with openly licensed music recordings.…”
Section: Introductionmentioning
confidence: 99%