2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017
DOI: 10.1109/asru.2017.8268950
|View full text |Cite
|
Sign up to set email alerts
|

Listening while speaking: Speech chain by deep learning

Abstract: Despite the close relationship between speech perception and production, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has progressed more or less independently without exerting much mutual influence on each other. In human communication, on the other hand, a closed-loop speech chain mechanism with auditory feedback from the speaker's mouth to her ear is crucial. In this paper, we take a step further and develop a closed-loop speech chain model based on deep learning. The se… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
99
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
5

Relationship

1
9

Authors

Journals

citations
Cited by 138 publications
(99 citation statements)
references
References 34 publications
0
99
0
Order By: Relevance
“…The term, speech chain, was first introduced in 1963 (and reissued in 1993) by Denes et al [8,9], where speech communication was described as a spoken message between the minds of the speakers and the listeners, based on speech production and speech perception processes. Using this as a basis, Tjandra et al, developed DeepChain [10], an approach to simultaneously train both recognition and synthesis systems. They proposed a sequence-to-sequence model in a closed-loop architecture that allows training with both, labeled and unlabeled data.…”
Section: Related Workmentioning
confidence: 99%
“…The term, speech chain, was first introduced in 1963 (and reissued in 1993) by Denes et al [8,9], where speech communication was described as a spoken message between the minds of the speakers and the listeners, based on speech production and speech perception processes. Using this as a basis, Tjandra et al, developed DeepChain [10], an approach to simultaneously train both recognition and synthesis systems. They proposed a sequence-to-sequence model in a closed-loop architecture that allows training with both, labeled and unlabeled data.…”
Section: Related Workmentioning
confidence: 99%
“…Even though the lack of natural speech dataset in this paper, VQ-VAE and codebook inverter can be applied and has shown a great performance on multispeaker natural speech [14,13]. Some papers [30,31,32] also show the performance improvement from the synthetic dataset can be carried over to the real dataset.…”
Section: Datasetmentioning
confidence: 96%
“…The speech chain model [9] is the most similar architecture to ours. As described in Section 1, the ASR model is trained with synthesized speech and the TTS model is trained with ASR hypotheses for unpaired data.…”
Section: Related Workmentioning
confidence: 99%