AISHELL-3: A Multi-Speaker Mandarin TTS Corpus

Shi, Yao; Bu, Hui; Xu, Xin; Zhang, Shaoji; Li, Ming

doi:10.21437/interspeech.2021-755

Cited by 75 publications

(11 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The evaluation was conducted on a test set sampled from a multi-speaker Mandarin speech corpus called AISHELL-3 [24]. The test set contains 4,267 utterances from 44 speakers.…”

Section: Speaker Anonymization Experiments In Mandarinmentioning

confidence: 99%

“…More importantly, the current VPC 2020 primary baseline B1 requires large amounts of language-specific resources to train a language-dependent ASR AM, while the SSL-based soft content encoder of the proposed method learns universal representations by training with unlabeled audio data, which improves the portability to a new language. Extensive experiments were conducted on the VPC 2020 datasets in English and AISHELL-3 [24] datasets in Mandarin to demonstrate the effectiveness of our proposed SSL-based language-independent speaker anonymization method.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models

Xing¹,

Wang²,

Cooper³

et al. 2022

Preprint

View full text Add to dashboard Cite

Speaker anonymization aims to protect the privacy of speakers while preserving spoken linguistic information from speech. Current mainstream neural network speaker anonymization systems are complicated, containing an F0 extractor, speaker encoder, automatic speech recognition acoustic model (ASR AM), speech synthesis acoustic model and speech waveform generation model. Moreover, as an ASR AM is language-dependent, trained on English data, it is hard to adapt it into another language. In this paper, we propose a simpler self-supervised learning (SSL)-based method for language-independent speaker anonymization without any explicit language-dependent model, which can be easily used for other languages. Extensive experiments were conducted on the VoicePrivacy Challenge 2020 datasets in English and AISHELL-3 datasets in Mandarin to demonstrate the effectiveness of our proposed SSL-based language-independent speaker anonymization method 1 .

show abstract

“…The evaluation was conducted on a test set sampled from a multi-speaker Mandarin speech corpus called AISHELL-3 [24]. The test set contains 4,267 utterances from 44 speakers.…”

Section: Speaker Anonymization Experiments In Mandarinmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models

Xing¹,

Wang²,

Cooper³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…[22] and Shi et al [191] introduced an identity feedback constraint by adding an additional loss term between the reference embedding and the extraction embedding of the synthesized signal, thus increasing the robustness and speaker similarity of the produced speeches.…”

Section: Multi-speaker Acoustic Modelmentioning

confidence: 99%

Review of end-to-end speech synthesis technology based on deep learning

Mu¹,

Yang²,

Dong³

2021

Preprint

View full text Add to dashboard Cite

As an indispensable part of modern humancomputer interaction system, speech synthesis technology helps users get the output of intelligent machine more easily and intuitively, thus has attracted more and more attention. Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech synthesis technology, which has more powerful modeling ability and a simpler pipeline. It mainly consists of three modules: text frontend, acoustic model, and vocoder. This paper reviews the research status of these three parts, and classifies and compares various methods according to their emphasis. Moreover, this paper also summarizes the opensource speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and objective speech quality evaluation method. Finally, some attractive future research directions are pointed out.

show abstract

“…Thanks to the superiority of deep learning and large-scale of high-quality open source speech corpus [1] [2], computational generated speech has reached humanlike naturalness and hi-fidelity audio quality. With a small batch of recording audio samples, state-of-art synthesis systems can generate non-distinguishable speech with high similarity of the target speaker.…”

Section: Introductionmentioning

confidence: 99%

“…In order to fool the detection systems, we further add a post-processing modification on the generated audio, which cause a slight decay in audio quality but a significant promotion in spoofing. Audio samples are available at our demo page 1 The rest of this paper is organized as follows. Section 2 introduces our proposed method.…”

Section: Introductionmentioning

confidence: 99%

Time Domain Adversarial Voice Conversion for ADD 2022

Cheng

Guo

Tan

et al. 2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we describe our speech generation system for the first Audio Deep Synthesis Detection Challenge (ADD 2022). Firstly, we build an any-to-many voice conversion (VC) system to convert source speech with arbitrary language content into target speaker's fake speech. Then the converted speech generated from VC is post-processed in time-domain to improve the deception ability. The experimental results show that our system has adversarial ability against anti-spoofing detectors with a little compromise in audio quality and speaker similarity. This system ranks top in Track 3.1 in the ADD 2022, showing that our method could also gain good generalization ability against different detectors.

show abstract

AISHELL-3: A Multi-Speaker Mandarin TTS Corpus

Cited by 75 publications

References 0 publications

Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models

Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models

Review of end-to-end speech synthesis technology based on deep learning

Time Domain Adversarial Voice Conversion for ADD 2022

Contact Info

Product

Resources

About