Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies

Liu, Alexander H.; Chung, Yu-An; Glass, James

doi:10.21437/interspeech.2021-349

Cited by 48 publications

(32 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some studies explore alternatives to masking the input directly. In non-autoregressive predictive coding (NPC) [89], time masking is introduced through masked convolution blocks. Taking inspiration from XLNet [90], it has also been suggested that the input be reconstructed from a shuffled version [91] to address the discrepancy between pre-training and fine-tuning of masking-based approaches.…”

Section: U L T I -T a R G E T A P C B E S T -R Qmentioning

confidence: 99%

Self-Supervised Speech Representation Learning: A Review

Mohamed¹,

Lee²,

Borgholt³

et al. 2022

Preprint

View full text Add to dashboard Cite

Section: U L T I -T a R G E T A P C B E S T -R Qmentioning

confidence: 99%

Self-Supervised Speech Representation Learning: A Review

Mohamed¹,

Lee²,

Borgholt³

et al. 2022

Preprint

View full text Add to dashboard Cite

“…Transformer encoders and bidirectional RNNs have been considered as context networks for realising MPC. Similarly, the recently proposed Non-autoregressive predictive coding (NPC) [52] also applies a mask on its model input, but it learns representations based on local dependencies of an input sequence, rather than globally. The MPC approaches can learn effective representations of sequential data in a non-autoregressive way, and hence achieve considerable speed-up in training.…”

Section: B Ssl Frameworkmentioning

confidence: 99%

Audio Self-supervised Learning: A Survey

Liu¹,

Mallol-Ragolta²,

Parada-Cabeleiro³

et al. 2022

Preprint

View full text Add to dashboard Cite

Inspired by the humans' cognitive ability to generalise knowledge and skills, Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations, which is an expensive and time consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarising the knowledge in audio SSL are currently missing. To fill this gap, in the present work, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarise the empirical works that exploit the audio modality in multimodal SSL frameworks, and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions on the development of audio SSL.

show abstract

“…Generative modeling incorporates language model-like training losses to predict unseen regions (such as future or masked frames), in order to maximize the likelihood of the observed data. Examples include APC [43], VQ-APC [44], Mockingjay [45], TERA [46], and NPC [47]. Discriminative modeling aims to discriminate (or contrast) the target unseen frame with randomly sampled ones, which is equivalent to mutual information maximization.…”

Section: B Self-supervised Speech Representation Learningmentioning

confidence: 99%

A Comparative Study of Self-Supervised Speech Representation Based Voice Conversion

Huang

Yang

Hayashi

et al. 2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

We present a large-scale comparative study of selfsupervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we previously developed, we provide a series of in-depth objective and subjective analyses under three VC settings: intra-/cross-lingual any-to-one (A2O) and any-to-any (A2A) VC, using the voice conversion challenge 2020 (VCC2020) dataset. We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision. We also studied the effect of a post-discretization process with k-means clustering and showed how it improves in the A2A setting. Finally, the comparison with state-of-the-art VC systems demonstrates the competitiveness of S3R-based VC and also sheds light on the possible improving directions.

show abstract

Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies

Cited by 48 publications

References 17 publications

Self-Supervised Speech Representation Learning: A Review

Self-Supervised Speech Representation Learning: A Review

Audio Self-supervised Learning: A Survey

A Comparative Study of Self-Supervised Speech Representation Based Voice Conversion

Contact Info

Product

Resources

About