2018
DOI: 10.48550/arxiv.1808.02939
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards Learning Fine-Grained Disentangled Representations from Speech

Abstract: Learning disentangled representations of high-dimensional data is currently an active research area. However, compared to the field of computer vision, less work has been done for speech processing. In this paper, we provide a review of two representative efforts on this topic and propose the novel concept of fine-grained disentangled speech representation learning.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 16 publications
0
3
0
Order By: Relevance
“…This has been applied to disentanglement of image features [10], sensor anonymization [11], voice conversion [12,13] and music translation [14]. This idea has also been proposed for disentangling multiple speech attributes by chaining several autoencoders trained in this fashion [15]. However, this adversarial training adds complexity and potential instability, and the necessary labels are difficult to obtain in the audio domain.…”
Section: Related Workmentioning
confidence: 99%
“…This has been applied to disentanglement of image features [10], sensor anonymization [11], voice conversion [12,13] and music translation [14]. This idea has also been proposed for disentangling multiple speech attributes by chaining several autoencoders trained in this fashion [15]. However, this adversarial training adds complexity and potential instability, and the necessary labels are difficult to obtain in the audio domain.…”
Section: Related Workmentioning
confidence: 99%
“…Natural speech has very complex manifolds [346]) and inherently contains information about the message, gender, age, health status, personality, friendliness, mood, and emotion. All of this information is entangled together [347], and the disentanglement of these attributes in some latent space is a very difficult task that requires extensive training. Most importantly, the training of unsupervised representation learning models is much more difficult in contrast to supervised ones.…”
Section: A Challenge Of Training Deep Architecturesmentioning
confidence: 99%
“…phoneme and linguistically irrelevant information like speaker characteristics. In the case of speech processing, an ideal disentangled representation would be able to separate fine-grained factors such as speaker identity, noise, recording channels, and prosody [22], as well as the linguistic content. Thus, disentanglement will allow learning of salient and robust representations from the speech that are essential for applications including speech recognition [64], prosody transfer [77,86], speaker verification [66], speech synthesis [31,77], and voice conversion [32], among other applications.…”
Section: Learning Disentangled Representationmentioning
confidence: 99%