ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9746996
|View full text |Cite
|
Sign up to set email alerts
|

Learning Music Audio Representations Via Weak Language Supervision

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(19 citation statements)
references
References 9 publications
0
14
0
Order By: Relevance
“…MusCaps [161] is a music audio captioning model that generates descriptions of music audio content by processing audio-text inputs through a multimodal encoder and leveraging audio data pre-training to obtain effective musical feature representations. For music and language pre-training, Manco et al [162] propose a multimodal architecture, which uses weakly aligned text as the only supervisory signal to learn general-purpose music audio representations. CLAP [163] is another method for learning audio concepts from natural language supervision that utilizes two encoders and contrastive learning to bring audio and text descriptions into a joint multimodal space.…”
Section: Text Audio Generationmentioning
confidence: 99%
“…MusCaps [161] is a music audio captioning model that generates descriptions of music audio content by processing audio-text inputs through a multimodal encoder and leveraging audio data pre-training to obtain effective musical feature representations. For music and language pre-training, Manco et al [162] propose a multimodal architecture, which uses weakly aligned text as the only supervisory signal to learn general-purpose music audio representations. CLAP [163] is another method for learning audio concepts from natural language supervision that utilizes two encoders and contrastive learning to bring audio and text descriptions into a joint multimodal space.…”
Section: Text Audio Generationmentioning
confidence: 99%
“…We adopt the T5 encoder (Raffel et al, 2020) and use the nonpooled token embedding sequence to condition the diffusion models. A thorough comparison with alternative contextual signals such as embeddings from different large language models, or a single vector embedding derived from CLIPlike (Radford et al, 2021) text encoders trained on musictext pairs (Huang et al, 2022;Manco et al, 2022) is beyond the scope of this work.…”
Section: Text Understandingmentioning
confidence: 99%
“…On the other hand, while the performance of automatic audio source separation algorithms has reached a satisfactory level in recent years [33][34][35][36], the process of extracting the various sources from large unlabelled music collections is time-consuming. As a middle ground, we utilized the Magna-Tag-A-Tune (MTAT) dataset [37], which has been used in the literature for self-supervised audio pretraining [10,13]. MTAT includes a total of 25863 song clips, with a duration of 30 seconds each, sampled at 16 kHz, each associated with a number of tags.…”
Section: Data and Preprocessingmentioning
confidence: 99%
“…Concerning the training of the downstream classifiers, in the case of MTAT, the commonly-used 12:1:3 split between training, validation and testing data [10,42] was employed, while for both NSynth and FMA we utilized the default splits between training, validation and testing data: in the case of NSynth, the training, validation and testing sets consist of 289205, 12678, and 4096 audio segments respectively, while for FMA, the data are split in a stratified way into training, validation and testing data, using an 8:1:1 ratio. For the cases of MTAT and FMA, we directly used a linear classifier, while following the literature [13,43], for NSynth, we used an intermediate layer with 512 neurons and a ReLU activation function. The classifiers were trained using Adam with a learning rate equal to 0.0005 for MTAT and FMA and 0.0003 for NSynth, and a batch size of 128, while early stopping was applied with a patience of 5 epochs.…”
Section: Experimental Protocolmentioning
confidence: 99%
See 1 more Smart Citation