2020
DOI: 10.48550/arxiv.2001.04643
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

DDSP: Differentiable Digital Signal Processing

Jesse Engel,
Lamtharn Hantrakul,
Chenjie Gu
et al.

Abstract: Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
42
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 28 publications
(47 citation statements)
references
References 21 publications
(31 reference statements)
1
42
0
Order By: Relevance
“…This approach assisted the process of text in deep learning applications by inserting the property of closer embeddings in vector space to encode words with similar meaning. The same approach has been adopted by sound processing to reduce the dimensionality of the signal [40] [22], enhance the timbre synthesis [3] or even generate a more interpretable representation [41] [42] to effectively extract parameters for a synthesizer. In [7] an autoencoder generates a latent representation to condition a WaveNet model while Dhariwal et al [43] implemented three separate encoders to generate vectors with different temporal resolutions.…”
Section: Embeddingsmentioning
confidence: 99%
See 4 more Smart Citations
“…This approach assisted the process of text in deep learning applications by inserting the property of closer embeddings in vector space to encode words with similar meaning. The same approach has been adopted by sound processing to reduce the dimensionality of the signal [40] [22], enhance the timbre synthesis [3] or even generate a more interpretable representation [41] [42] to effectively extract parameters for a synthesizer. In [7] an autoencoder generates a latent representation to condition a WaveNet model while Dhariwal et al [43] implemented three separate encoders to generate vectors with different temporal resolutions.…”
Section: Embeddingsmentioning
confidence: 99%
“…However, in cases where the amount of training data is not sufficient, additional data with similar properties can be included by applying conditioning methods. Following these techniques, the generated sound can be conditioned on specific traits such as speaker's voice [47] [27], independent pitch [3] [48] [36], linguistic features [49] [17] or latent representations [4] [45]. Instead of one-hot-embeddings, some implementations have also used a confusion matrix to capture a variation of emotions [39], while others provided supplementary positional information of each segment conditioning music to the artist or genre [43].…”
Section: Conditioning Representationsmentioning
confidence: 99%
See 3 more Smart Citations