* Corresponding author 1 Research applying machine learning to music modeling and generation typically proposes model architectures, training methods and datasets, and gauges system performance using quantitative measures like sequence likelihoods and/or qualitative listening tests. Rarely does such work explicitly question and analyse its usefulness for and impact on real-world practitioners, and then build on those outcomes to inform the development and application of machine learning. This article attempts to do these things for machine learning applied to music creation. Together with practitioners, we develop and use several applications of machine learning for music creation, and present a public concert of the results. We reflect on the entire experience to arrive at several ways of advancing these and similar applications of machine learning to music creation.
Abstract-Sound sample indexing usually deals with the recognition of the source/cause that has produced the sound. For abstract sounds, sound-effects, unnatural or synthetic sounds this cause is usually unknown or unrecognizable. An efficient description of these sounds has been proposed by Schaeffer under the name morphological description. Part of this description consists in describing a sound by identifying the temporal evolution of its acoustic properties to a set of profiles. In this work, we consider three morphological descriptions: dynamic profiles (ascending, descending, ascending/descending, stable, impulsive), melodic profiles (up, down, stable, up/down, down/up) and complex-iterative sound description (non-iterative, iterative, grain, repetition). We study the automatic indexing of a sound into these profiles. Because this automatic indexing is difficult using standard audio features, we propose new audio features to perform this task. The dynamic profiles are estimated by modeling the loudness over-time of a sound by a second-order B-spline model and derive features from this model. The melodic profiles are estimated by tracking over time the perceptual filter which has the maximum excitation. A function is derived from this track which is then modeled using a second-order B-spline model. The features are again derived from the B-spline model. The description of complex-iterative sounds is obtained by estimating the amount of repetition and the period of the repetition. These are obtained by computing an audio similarity function derived from an MFCC similarity matrix. The proposed audio features are then tested for automatic classification. We consider three classification tasks corresponding to the three profiles. In each case, the results are compared with the ones obtained using standard audio features.
Morphological description has been proposed by Pierre Schaeffer. It consists in describing sounds by identifying the temporal evolution of their acoustical properties to a set of profiles. This kind of description is especially useful for indexing sounds with unknown cause such as SoundFX. The present work deals with the automatic estimation of this morphological description from audio signal analysis. In this work, three morphological descriptions are considered: - dynamic profiles (ascending, descending, ascending/descending, stable, impulsive), - melodic profiles (asc., desc. fixed, up/down, down/up) - repetition profiles. For each case we present the most appropriate audio features (loudness, pitch, pitch salience, temporal increase/decrease, lag-matrix-periodicity, ...) and mapping algorithm (slope computed from spline approximations of temporal profiles, ...) used to automatically estimate the profiles. We demonstrate the use of these descriptions for automatic indexing (using decision trees) and search-by-similarity of SoundFX.
Deep learning has given AI-based methods for music creation a boost by over the past years. An important challenge in this field is to balance user control and autonomy in music generation systems. In this work, we present BassNet, a deep learning model for generating bass guitar tracks based on musical source material. An innovative aspect of our work is that the model is trained to learn a temporally stable two-dimensional latent space variable that offers interactive user control. We empirically show that the model can disentangle bass patterns that require sensitivity to harmony, instrument timbre, and rhythm. An ablation study reveals that this capability is because of the temporal stability constraint on latent space trajectories during training. We also demonstrate that models that are trained on pop/rock music learn a latent space that offers control over the diatonic characteristics of the output, among other things. Lastly, we present and discuss generated bass tracks for three different music fragments. The work that is presented here is a step toward the integration of AI-based technology in the workflow of musical content creators.
This article introduces a model called " System & Contrast" (S&C), which aims at describing the inner organization of structural segments within music pieces in terms of : (i) a carrier system, i.e. a sequence of morphological elements forming a multi-dimensional network of self-deducible syntagmatic relationships and (ii) a contrast, i.e. a substitutive element, usually the last one, which partly departs from the logic implied by the rest of the system. With a primary focus on pop music, the S&C model provides a framework to describe internal implication patterns in musical segments by encoding similarities and relations between its constitutive elements so as to minimize the complexity of the resulting description. It is applicable at several timescales and to a wide variety of musical dimensions in a polymorphous way, therefore offering an attractive meta-description of different types of musical contents. It has been used as a central component in the creation of a set of annotations for 380 pop songs (Bimbot, Sargent, Deruty, Guichaoua & Vincent, 2014).This article formalizes the S&C model, illustrates how it applies to music and establishes its filiation with Narmour's Implication-Realization model (Narmour 1990(Narmour , 1992
Although the use of AI technology for music production is still in its infancy, it has the potential to make a lasting impact on the way we produce music. In this paper we focus on the design and use of AI music tools for the production of contemporary Popular Music, in particular genres involving studio technology as part of the creative process. First we discuss how music production practices associated with those genres can differ significantly from traditional views of how a musical work is created, and how this affects AI music technology. We argue that-given the role of symbolic representations in this context, as well as the integration of composition activities with editing and mixing-audio-based AI tools are better suited to support the artist's creative workflow than purely piano-roll/MIDI-based tools. Then we give a report of collaborations with professional artists, in which we look at how various AI tools are used in practice to produce music. We identify usage patterns as well as issues and challenges that arise in practical use of the tools. Based on this we formulate some recommendations and validation criteria for the development of AI technology for contemporary Popular Music.
The tremendous success of rock music in the second half of the 20th century has boosted the sophistication of production and mixing techniques for this music genre. However, there is no unified theory of mixing from the viewpoint of sound engineering. In this paper, we highlight relationships between loudness and spectrum in individual tracks, established during the process of mixing. To do so, we introduce an ad hoc, three-dimensional model of the spectrum of a track. These dimensions are derived from an optimal monitoring level, that is, the level that optimizes the number of frequency bands at the same, maximum loudness. We study a corpus of 55 rock multi-tracks and correlate the model with the loudness of the tracks. We suggest that (1) at high monitoring levels and/or on high-end monitors, track loudness is a linear function of its spectral centroid, and (2) at low monitoring levels and/or on budget monitors, a track's optimal monitoring level is a linear function of its loudness. This indicates that under good listening conditions, human mixers tend to focus on spectral balance, whereas under bad conditions, they favor individual track comprehension. We discuss the implication of our results for automatic mixing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.