2021 International Joint Conference on Neural Networks (IJCNN) 2021
DOI: 10.1109/ijcnn52387.2021.9533461
|View full text |Cite
|
Sign up to set email alerts
|

MusCaps: Generating Captions for Music Audio

Abstract: We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-andlanguage models. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-tomusic generation and music-language retrieval). Our e… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 35 publications
0
4
0
Order By: Relevance
“…[160] is proposed to generate descriptions for music playlists by combining audio content analysis and natural language processing to utilize the information of each track. MusCaps [161] is a music audio captioning model that generates descriptions of music audio content by processing audio-text inputs through a multimodal encoder and leveraging audio data pre-training to obtain effective musical feature representations. For music and language pre-training, Manco et al [162] propose a multimodal architecture, which uses weakly aligned text as the only supervisory signal to learn general-purpose music audio representations.…”
Section: Text Audio Generationmentioning
confidence: 99%
See 1 more Smart Citation
“…[160] is proposed to generate descriptions for music playlists by combining audio content analysis and natural language processing to utilize the information of each track. MusCaps [161] is a music audio captioning model that generates descriptions of music audio content by processing audio-text inputs through a multimodal encoder and leveraging audio data pre-training to obtain effective musical feature representations. For music and language pre-training, Manco et al [162] propose a multimodal architecture, which uses weakly aligned text as the only supervisory signal to learn general-purpose music audio representations.…”
Section: Text Audio Generationmentioning
confidence: 99%
“…Text-Audio Generation https://github.com/rishikksh20/AdaSpeech2 2020 Lombard [154] Text-Audio Generation https://github.com/dipjyoti92/TTS-Style-Transfer 2019 Zhang et al [156] Text-Audio Generation https://github.com/PaddlePaddle/PaddleSpeech 2019 Yu et al [157] Text-Music Generation -2018 JTAV [158] Text-Music Generation https://github.com/mengshor/JTAV 2021 Ferraro et al [159] Text-Music Generation https://github.com/andrebola/contrastive-mir-learning 2016 Choi et al [160] Text-Music Generation -2021 MusCaps [161] Text-Music Generation https://github.com/ilaria-manco/muscaps 2022 Manco et al [162] Text-Music Generation https://github.com/ilaria-manco/mulap 2022 CLAP [163] Text-Music Generation https://github.com/YuanGongND/vocalsound 2020 Jukebox [203] Text-Music Generation https://github.com/openai/jukebox…”
Section: A Curated Advances In Generative Aimentioning
confidence: 99%
“…Core research in MIR, however, still focuses on tasks such as key and chord recognition [100], [101], tempo and beat tracking [102], the detection of musical note onsets [103], [104], automatic music transcription [105], classification [106], and description (also known as captioning) [107], [108] as well as music emotion recognition [109]- [111]. A large body of research considers musical audio in these tasks to support search, retrieval and interaction use cases.…”
Section: Software: -Acoustic Features Extraction Algorithms -Detectio...mentioning
confidence: 99%
“…Generating full sentence descriptions of a music piece may be considered an extension of the tagging problem. This involves the use of an acoustic model and a large language model [108].…”
Section: Software: -Acoustic Features Extraction Algorithms -Detectio...mentioning
confidence: 99%