Cycle-consistency Training for End-to-end Speech Recognition

Hori, Takaaki; Astudillo, Ramón Fernandez; Hayashi, Tomoki; Yu, Zhang; Watanabe, Shinji; Roux, Jonathan Le

doi:10.1109/icassp.2019.8683307

Cited by 66 publications

(37 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We could consider approximating the expected loss as the sum of the WA losses for a given number of T-F representations obtained by sampling all T-F bins. Back-propagation could then be performed using the policy gradient technique in the REINFORCE algorithm [27], similarly to what was done for automatic speech recognition in [28]. Another option would be to rely on the Gumbel-Softmax trick [29], [30].…”

Section: E Inference Considerations and Expected Lossmentioning

confidence: 99%

Phasebook and Friends: Leveraging Discrete Representations for Source Separation

Roux

Wichern

Watanabe

et al. 2019

IEEE J. Sel. Top. Signal Process.

Self Cite

View full text Add to dashboard Cite

Deep learning based speech enhancement and source separation systems have recently reached unprecedented levels of quality, to the point that performance is reaching a new ceiling. Most systems rely on estimating the magnitude of a target source by estimating a real-valued mask to be applied to a time-frequency representation of the mixture signal. A limiting factor in such approaches is a lack of phase estimation: the phase of the mixture is most often used when reconstructing the estimated time-domain signal. Here, we propose "magbook", "phasebook", and "combook", three new types of layers based on discrete representations that can be used to estimate complex time-frequency masks. Magbook layers extend classical sigmoidal units and a recently introduced convex softmax activation for mask-based magnitude estimation. Phasebook layers use a similar structure to give an estimate of the phase mask without suffering from phase wrapping issues. Combook layers are an alternative to the magbook-phasebook combination that directly estimate complex masks. We present various training and inference schemes involving these representations, and explain in particular how to include them in an end-to-end learning framework. We also present an oracle study to assess upper bounds on performance for various types of masks using discrete phase representations. We evaluate the proposed methods on the wsj0-2mix dataset, a well-studied corpus for single-channel speaker-independent speaker separation, matching the performance of state-of-theart mask-based approaches without requiring additional phase reconstruction steps.

show abstract

Section: E Inference Considerations and Expected Lossmentioning

confidence: 99%

Phasebook and Friends: Leveraging Discrete Representations for Source Separation

Roux

Wichern

Watanabe

et al. 2019

IEEE J. Sel. Top. Signal Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Baskar et al [35] proposed an alternative to backpropagate through discrete variables by using a policy-gradient method, compared to our proposal using a straight-through estimator. Hori et al [36] replaced TTS with text-to-encoder (TTE) to avoid the need for modeling the speaking style during the reconstruction.…”

Section: Related Workmentioning

confidence: 99%

Machine Speech Chain

Tjandra

Sakti

Nakamura

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Despite the close relationship between speech perception and production, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has progressed more or less independently without exerting much mutual influence. In human communication, on the other hand, a closed-loop speech chain mechanism with auditory feedback from the speaker's mouth to her ear is crucial. In this paper, we take a step further and develop a closed-loop machine speech chain model based on deep learning. The sequence-to-sequence model in closed-loop architecture allows us to train our model on the concatenation of both labeled and unlabeled data. While ASR transcribes the unlabeled speech features, TTS attempts to reconstruct the original speech waveform based on the text from ASR. In the opposite direction, ASR also attempts to reconstruct the original text transcription given the synthesized speech. To the best of our knowledge, this is the first deep learning framework that integrates human speech perception and production behaviors. Our experimental results show that the proposed approach significantly improved performance over that from separate systems that were only trained with labeled data.

show abstract

“…Optionally, unpaired speech and text data can be leveraged. • In the low-resource setting, the single-speaker high-quality paired data are reduced to dozens of minutes in TTS [2,12,23,31] while the multi-speaker low-quality paired data is reduced to dozens of hours in ASR [16,32,33,39], compared to that in the richresource setting. Additionally, they leverage unpaired speech and text data to ensure the performance.…”

Section: Related Workmentioning

confidence: 99%

LRSpeech

Tan

Ren

et al. 2020

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR) are important speech tasks, and require a large amount of text and speech pairs for model training. However, there are more than 6,000 languages in the world and most languages are lack of speech training data, which poses significant challenges when building TTS and ASR systems for extremely lowresource languages. In this paper, we develop LRSpeech, a TTS and ASR system under the extremely low-resource setting, which can support rare languages with low data cost. LRSpeech consists of three key techniques: 1) pre-training on rich-resource languages and fine-tuning on low-resource languages; 2) dual transformation between TTS and ASR to iteratively boost the accuracy of each other; 3) knowledge distillation to customize the TTS model on a high-quality target-speaker voice and improve the ASR model on multiple voices. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. Experimental results show that LRSpeech 1) achieves high quality for TTS in terms of both intelligibility (more than 98% intelligibility rate) and naturalness (above 3.5 mean opinion score (MOS)) of the synthesized speech, which satisfy the requirements for industrial deployment, 2) achieves promising recognition accuracy for ASR, and 3) last but not least, uses extremely low-resource training data. We also conduct comprehensive analyses on LRSpeech with different amounts of data resources, and provide valuable insights and guidances for industrial deployment. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.

show abstract

Cycle-consistency Training for End-to-end Speech Recognition

Cited by 66 publications

References 24 publications

Phasebook and Friends: Leveraging Discrete Representations for Source Separation

Phasebook and Friends: Leveraging Discrete Representations for Source Separation

Machine Speech Chain

LRSpeech

Contact Info

Product

Resources

About