ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683307
|View full text |Cite
|
Sign up to set email alerts
|

Cycle-consistency Training for End-to-end Speech Recognition

Abstract: This paper presents a method to train end-to-end automatic speech recognition (ASR) models using unpaired data. Although the endto-end approach can eliminate the need for expert knowledge such as pronunciation dictionaries to build ASR systems, it still requires a large amount of paired data, i.e., speech utterances and their transcriptions. Cycle-consistency losses have been recently proposed as a way to mitigate the problem of limited paired data. These approaches compose a reverse operation with a given tra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
37
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
7
2
1

Relationship

3
7

Authors

Journals

citations
Cited by 66 publications
(37 citation statements)
references
References 24 publications
0
37
0
Order By: Relevance
“…We could consider approximating the expected loss as the sum of the WA losses for a given number of T-F representations obtained by sampling all T-F bins. Back-propagation could then be performed using the policy gradient technique in the REINFORCE algorithm [27], similarly to what was done for automatic speech recognition in [28]. Another option would be to rely on the Gumbel-Softmax trick [29], [30].…”
Section: E Inference Considerations and Expected Lossmentioning
confidence: 99%
“…We could consider approximating the expected loss as the sum of the WA losses for a given number of T-F representations obtained by sampling all T-F bins. Back-propagation could then be performed using the policy gradient technique in the REINFORCE algorithm [27], similarly to what was done for automatic speech recognition in [28]. Another option would be to rely on the Gumbel-Softmax trick [29], [30].…”
Section: E Inference Considerations and Expected Lossmentioning
confidence: 99%
“…Baskar et al [35] proposed an alternative to backpropagate through discrete variables by using a policy-gradient method, compared to our proposal using a straight-through estimator. Hori et al [36] replaced TTS with text-to-encoder (TTE) to avoid the need for modeling the speaking style during the reconstruction.…”
Section: Related Workmentioning
confidence: 99%
“…Optionally, unpaired speech and text data can be leveraged. • In the low-resource setting, the single-speaker high-quality paired data are reduced to dozens of minutes in TTS [2,12,23,31] while the multi-speaker low-quality paired data is reduced to dozens of hours in ASR [16,32,33,39], compared to that in the richresource setting. Additionally, they leverage unpaired speech and text data to ensure the performance.…”
Section: Related Workmentioning
confidence: 99%