Listening while speaking: Speech chain by deep learning

Tjandra, Andros; Sakti, Sakriani; Nakamura, Satoshi

doi:10.1109/asru.2017.8268950

Cited by 138 publications

(99 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The term, speech chain, was first introduced in 1963 (and reissued in 1993) by Denes et al [8,9], where speech communication was described as a spoken message between the minds of the speakers and the listeners, based on speech production and speech perception processes. Using this as a basis, Tjandra et al, developed DeepChain [10], an approach to simultaneously train both recognition and synthesis systems. They proposed a sequence-to-sequence model in a closed-loop architecture that allows training with both, labeled and unlabeled data.…”

Section: Related Workmentioning

confidence: 99%

Speech Recognition with Augmented Synthesized Speech

Rosenberg

Zhang

Ramabhadran

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and style variations derived from input acoustic representations thereby allowing for manipulation of the synthesized speech. In this paper, we evaluate the feasibility of enhancing speech recognition performance using speech synthesis using two corpora from different domains. We explore algorithms to provide the necessary acoustic and lexical diversity needed for robust speech recognition. Finally, we demonstrate the feasibility of this approach as a data augmentation strategy for domaintransfer. We find that improvements to speech recognition performance is achievable by augmenting training data with synthesized material. However, there remains a substantial gap in performance between recognizers trained on human speech those trained on synthesized speech.

show abstract

Section: Related Workmentioning

confidence: 99%

Speech Recognition with Augmented Synthesized Speech

Rosenberg

Zhang

Ramabhadran

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

show abstract

“…Even though the lack of natural speech dataset in this paper, VQ-VAE and codebook inverter can be applied and has shown a great performance on multispeaker natural speech [14,13]. Some papers [30,31,32] also show the performance improvement from the synthetic dataset can be carried over to the real dataset.…”

Section: Datasetmentioning

confidence: 96%

Speech-to-Speech Translation Between Untranscribed Unknown Languages

Tjandra

Sakti

Nakamura

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

In this paper, we explore a method for training speech-to-speech translation tasks without any transcription or linguistic supervision. Our proposed method consists of two steps: First, we train and generate discrete representation with unsupervised term discovery with a discrete quantized autoencoder. Second, we train a sequence-tosequence model that directly maps the source language speech to the target languages discrete representation. Our proposed method can directly generate target speech without any auxiliary or pre-training steps with a source or target transcription. To the best of our knowledge, this is the first work that performed pure speech-to-speech translation between untranscribed unknown languages.

show abstract

“…The speech chain model [9] is the most similar architecture to ours. As described in Section 1, the ASR model is trained with synthesized speech and the TTS model is trained with ASR hypotheses for unpaired data.…”

Section: Related Workmentioning

confidence: 99%

Cycle-consistency Training for End-to-end Speech Recognition

Hori

Astudillo²,

Hayashi

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper presents a method to train end-to-end automatic speech recognition (ASR) models using unpaired data. Although the endto-end approach can eliminate the need for expert knowledge such as pronunciation dictionaries to build ASR systems, it still requires a large amount of paired data, i.e., speech utterances and their transcriptions. Cycle-consistency losses have been recently proposed as a way to mitigate the problem of limited paired data. These approaches compose a reverse operation with a given transformation, e.g., text-to-speech (TTS) with ASR, to build a loss that only requires unsupervised data, speech in this example. Applying cycle consistency to ASR models is not trivial since fundamental information, such as speaker traits, are lost in the intermediate text bottleneck. To solve this problem, this work presents a loss that is based on the speech encoder state sequence instead of the raw speech signal. This is achieved by training a Text-To-Encoder model and defining a loss based on the encoder reconstruction error. Experimental results on the LibriSpeech corpus show that the proposed cycle-consistency training reduced the word error rate by 14.7% from an initial model trained with 100-hour paired data, using an additional 360 hours of audio data without transcriptions. We also investigate the use of textonly data mainly for language modeling to further improve the performance in the unpaired data training scenario.

show abstract

Listening while speaking: Speech chain by deep learning

Cited by 138 publications

References 34 publications

Speech Recognition with Augmented Synthesized Speech

Speech Recognition with Augmented Synthesized Speech

Speech-to-Speech Translation Between Untranscribed Unknown Languages

Cycle-consistency Training for End-to-end Speech Recognition

Contact Info

Product

Resources

About