DDSP: Differentiable Digital Signal Processing

Engel, Jesse; Hantrakul, Lamtharn; Gu, Chenjie; Roberts, Adam

doi:10.48550/arxiv.2001.04643

Cited by 28 publications

(47 citation statements)

References 21 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This approach assisted the process of text in deep learning applications by inserting the property of closer embeddings in vector space to encode words with similar meaning. The same approach has been adopted by sound processing to reduce the dimensionality of the signal [40] [22], enhance the timbre synthesis [3] or even generate a more interpretable representation [41] [42] to effectively extract parameters for a synthesizer. In [7] an autoencoder generates a latent representation to condition a WaveNet model while Dhariwal et al [43] implemented three separate encoders to generate vectors with different temporal resolutions.…”

Section: Embeddingsmentioning

confidence: 99%

“…However, in cases where the amount of training data is not sufficient, additional data with similar properties can be included by applying conditioning methods. Following these techniques, the generated sound can be conditioned on specific traits such as speaker's voice [47] [27], independent pitch [3] [48] [36], linguistic features [49] [17] or latent representations [4] [45]. Instead of one-hot-embeddings, some implementations have also used a confusion matrix to capture a variation of emotions [39], while others provided supplementary positional information of each segment conditioning music to the artist or genre [43].…”

Section: Conditioning Representationsmentioning

confidence: 99%

“…Many variations of VAE have been applied for sound generation topics. In [3] they used VAE with feedforward networks and an additive synthesiser to reproduce monophonic musical notes. In [69] and [40] they applied convolutional layers while in [36] a Variational Parametric Synthesiser was proposed using a conditional VAE.…”

Section: Variational Autoencodersmentioning

confidence: 99%

“…Distances-based measurements have also been investigated individually by separate parameter estimations. In [3] distances between the generated loudness and fundamental frequency of synthesised and training data are used.…”

Section: Distances-based Measurementsmentioning

confidence: 99%

“…These models discover latent representations based on the distribution of the initial data and then sample from this distribution to generate new acoustic signals with the same properties as the original ones. In many cases, the deep learning models can operate along with signal processing algorithms and enhance their expression capabilities [3] [4].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Audio representations for deep learning in sound synthesis: A review

Anastasia¹,

O’Leary²

2022

Preprint

View full text Add to dashboard Cite

The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficientlyand complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.

show abstract

Section: Embeddingsmentioning

confidence: 99%

Section: Conditioning Representationsmentioning

confidence: 99%

Section: Variational Autoencodersmentioning

confidence: 99%

Section: Distances-based Measurementsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Audio representations for deep learning in sound synthesis: A review

Anastasia¹,

O’Leary²

2022

Preprint

View full text Add to dashboard Cite

show abstract

Empirical Analysis of Gestural Sonic Objects Combining Qualitative and Quantitative Methods

Visi,

Schramm,

Frödin

et al. 2024

Current Research in Systematic Musicology

View full text Add to dashboard Cite

In this chapter, we describe a series of studies related to our research on using gestural sonic objects in music analysis. These include developing a method for annotating the qualities of gestural sonic objects on multimodal recordings; ranking which features in a multimodal dataset are good predictors of basic qualities of gestural sonic objects using the Random Forests algorithm; and a supervised learning method for automated spotting designed to assist human annotators. The subject of our analyses is a performance of Fragmente2, a choreomusical composition based on the Japanese composer Makoto Shinohara’s solo piece for tenor recorder Fragmente (1968). To obtain the dataset, we carried out a multimodal recording of a full performance of the piece and obtained synchronised audio, video, motion, and electromyogram (EMG) data describing the body movements of the performers. We then added annotations on gestural sonic objects through dedicated qualitative analysis sessions. The task of annotating gestural sonic objects on the recordings of this performance has led to a meticulous examination of related theoretical concepts to establish a method applicable beyond this case study. This process of gestural sonic object annotation—like other qualitative approaches involving manual labelling of data—has proven to be very time-consuming. This motivated the exploration of data-driven, automated approaches to assist expert annotators.

show abstract

Ethical aspects of integrating AI expert models in the process of retrieval and use of ICH registry material

Garcia-Lara

Bugueno-Cordova

2023

International Journal of Performance Arts and Digital Media

View full text Add to dashboard Cite

DDSP: Differentiable Digital Signal Processing

Cited by 28 publications

References 21 publications

Audio representations for deep learning in sound synthesis: A review

Audio representations for deep learning in sound synthesis: A review

Empirical Analysis of Gestural Sonic Objects Combining Qualitative and Quantitative Methods

Ethical aspects of integrating AI expert models in the process of retrieval and use of ICH registry material

Contact Info

Product

Resources

About