Learning Music Audio Representations Via Weak Language Supervision

Manco, Ilaria; Benetos, Emmanouil; Quinton, Elio; Fazekas, György

doi:10.1109/icassp43922.2022.9746996

Cited by 21 publications

(19 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MusCaps [161] is a music audio captioning model that generates descriptions of music audio content by processing audio-text inputs through a multimodal encoder and leveraging audio data pre-training to obtain effective musical feature representations. For music and language pre-training, Manco et al [162] propose a multimodal architecture, which uses weakly aligned text as the only supervisory signal to learn general-purpose music audio representations. CLAP [163] is another method for learning audio concepts from natural language supervision that utilizes two encoders and contrastive learning to bring audio and text descriptions into a joint multimodal space.…”

Section: Text Audio Generationmentioning

confidence: 99%

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

Cao¹,

Li²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recently, ChatGPT, along with DALL-E-2 [1] and Codex [2],has been gaining significant attention from society. As a result, many individuals have become interested in related resources and are seeking to uncover the background and secrets behind its impressive performance. In fact, ChatGPT and other Generative AI (GAI) techniques belong to the category of Artificial Intelligence Generated Content (AIGC), which involves the creation of digital content, such as images, music, and natural language, through AI models. The goal of AIGC is to make the content creation process more efficient and accessible, allowing for the production of high-quality content at a faster pace. AIGC is achieved by extracting and understanding intent information from instructions provided by human, and generating the content according to its knowledge and the intent information. In recent years, large-scale models have become increasingly important in AIGC as they provide better intent extraction and thus, improved generation results. With the growth of data and the size of the models, the distribution that the model can learn becomes more comprehensive and closer to reality, leading to more realistic and high-quality content generation. This survey provides a comprehensive review on the history of generative models, and basic components, recent advances in AIGC from unimodal interaction and multimodal interaction. From the perspective of unimodality, we introduce the generation tasks and relative models of text and image. From the perspective of multimodality, we introduce the cross-application between the modalities mentioned above. Finally, we discuss the existing open problems and future challenges in AIGC.

show abstract

Section: Text Audio Generationmentioning

confidence: 99%

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

Cao¹,

Li²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…We adopt the T5 encoder (Raffel et al, 2020) and use the nonpooled token embedding sequence to condition the diffusion models. A thorough comparison with alternative contextual signals such as embeddings from different large language models, or a single vector embedding derived from CLIPlike (Radford et al, 2021) text encoders trained on musictext pairs (Huang et al, 2022;Manco et al, 2022) is beyond the scope of this work.…”

Section: Text Understandingmentioning

confidence: 99%

Noise2Music: Text-conditioned Music Generation with Diffusion Models

Huang¹,

Park²,

Wang³

et al. 2023

Preprint

View full text Add to dashboard Cite

We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era, but goes beyond to ground finegrained semantics of the prompt. Pretrained large language models play a key role in this storythey are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.

show abstract

“…On the other hand, while the performance of automatic audio source separation algorithms has reached a satisfactory level in recent years [33][34][35][36], the process of extracting the various sources from large unlabelled music collections is time-consuming. As a middle ground, we utilized the Magna-Tag-A-Tune (MTAT) dataset [37], which has been used in the literature for self-supervised audio pretraining [10,13]. MTAT includes a total of 25863 song clips, with a duration of 30 seconds each, sampled at 16 kHz, each associated with a number of tags.…”

Section: Data and Preprocessingmentioning

confidence: 99%

“…Concerning the training of the downstream classifiers, in the case of MTAT, the commonly-used 12:1:3 split between training, validation and testing data [10,42] was employed, while for both NSynth and FMA we utilized the default splits between training, validation and testing data: in the case of NSynth, the training, validation and testing sets consist of 289205, 12678, and 4096 audio segments respectively, while for FMA, the data are split in a stratified way into training, validation and testing data, using an 8:1:1 ratio. For the cases of MTAT and FMA, we directly used a linear classifier, while following the literature [13,43], for NSynth, we used an intermediate layer with 512 neurons and a ReLU activation function. The classifiers were trained using Adam with a learning rate equal to 0.0005 for MTAT and FMA and 0.0003 for NSynth, and a batch size of 128, while early stopping was applied with a patience of 5 epochs.…”

Section: Experimental Protocolmentioning

confidence: 99%

“…In the case of music processing, contrastive self-supervised learning has emerged as a way to learn useful audio rep-resentations from large music collections [9] and proven competitive to fully supervised alternatives, in tasks such as music auto-tagging [9,10], genre recognition [9] and music recommendation [11]; while similar performance can be achieved by using sufficiently large out-of-domain datasets [7,12]. Moreover, a recent development concerns the transfer of contrastive learning in a multimodal setting, combining information between, i.e., audio and text [13,14]. An important part of every contrastive learning paradigm is the method used to create the contrastive pairs.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Source Contrastive Learning from Musical Audio

Garoufis¹,

Zlatintsi²,

Maragos³

2023

Preprint

View full text Add to dashboard Cite

Contrastive learning constitutes an emerging branch of selfsupervised learning that leverages large amounts of unlabeled data, by learning a latent space, where pairs of different views of the same sample are associated. In this paper, we propose musical source association as a pair generation strategy in the context of contrastive music representation learning. To this end, we modify COLA, a widely used contrastive learning audio framework, to learn to associate a song excerpt with a stochastically selected and automatically extracted vocal or instrumental source. We further introduce a novel modification to the contrastive loss to incorporate information about the existence or absence of specific sources. Our experimental evaluation in three different downstream tasks (music auto-tagging, instrument classification and music genre classification) using the publicly available Magna-Tag-A-Tune (MTAT) as a source dataset yields competitive results to existing literature methods, as well as faster network convergence. The results also show that this pre-training method can be steered towards specific features, according to the selected musical source, while also being dependent on the quality of the separated sources.

show abstract

Learning Music Audio Representations Via Weak Language Supervision

Cited by 21 publications

References 9 publications

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

Noise2Music: Text-conditioned Music Generation with Diffusion Models

Multi-Source Contrastive Learning from Musical Audio

Contact Info

Product

Resources

About