Recent Developments on ESPnet Toolkit Boosted by Conformer

Guo, Pengcheng; Boyer, F.; Chang, Xuankai; Hayashi, Tomoki; Higuchi, Yuki; Inaguma, Hirofumi; Kamo, Naoyuki; Li, Chenda; Garcia‐Romero, Daniel; Shi, Jiatong; Jing, Shi; Watanabe, Shinji; Zhang, Wangyou; Zhang, Yuekai

doi:10.48550/arxiv.2010.13956

Cited by 31 publications

(36 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As output labels, 256-word pieces based on Jamo (Korean alphabet) were used. The other model specifications and training strategy can be found in [28].…”

Section: A Experimental Setupmentioning

confidence: 99%

Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

Kim¹,

Lee²

2021

Preprint

View full text Add to dashboard Cite

Recurrent neural network transducers (RNN-T) are a promising end-to-end speech recognition framework that transduces input acoustic frames into a character sequence. The state-of-the-art encoder network for RNN-T is the Conformer, which can effectively model the local-global context information via its convolution and self-attention layers. Although Conformer RNN-T has shown outstanding performance (measured by word error rate (WER) in general), most studies have been verified in the setting where the train and test data are drawn from the same domain. The domain mismatch problem for Conformer RNN-T has not been intensively investigated yet, which is an important issue for the product-level speech recognition system. In this study, we identified that fully connected self-attention layers in the Conformer caused high deletion errors, specifically in the long-form out-domain utterances. To address this problem, we introduce sparse self-attention layers for Conformer-based encoder networks, which can exploit local and generalized global information by pruning most of the in-domain fitted global connections. Further, we propose a state reset method for the generalization of the prediction network to cope with long-form utterances. Applying proposed methods to an out-domain test, we obtained 24.6% and 6.5% relative character error rate (CER) reduction compared to the fully connected and local self-attention layer-based Conformers, respectively.

show abstract

“…As output labels, 256-word pieces based on Jamo (Korean alphabet) were used. The other model specifications and training strategy can be found in [28].…”

Section: A Experimental Setupmentioning

confidence: 99%

Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

Kim¹,

Lee²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…where β is the weight that balances the CTC and the CE loss. In the decoding stage, only the probabilities of the decoder and WPCTC loss are combined to obtain the final output [14,23,24]:…”

Section: Pm-mmut: Multi-modeling Unit Training Fusion With Pm Trainingmentioning

confidence: 99%

“…For Uyghur speech recognition task, following our previous setups [14], the experiments use 40 Mel Frequency Cepstral Coefficients (MFCCs) over 25 ms frames with 10 ms stride to each of which cepstral mean and variance normalization (CMVN) is applied. In English tasks, following [24], we use 80-dimensional logmel spectral energies plus 3 extra features for pitch information as acoustic features input. Following [14,24], the trade off weight β was set to 0.3 over all the tasks.…”

Section: Experiments Setupmentioning

confidence: 99%

“…In English tasks, following [24], we use 80-dimensional logmel spectral energies plus 3 extra features for pitch information as acoustic features input. Following [14,24], the trade off weight β was set to 0.3 over all the tasks. For the E2E configuration, we use a similar setup in our work [14] in the Uyghur ASR experiment, and [24] for the Librispeech task.…”

Section: Experiments Setupmentioning

confidence: 99%

“…Following [14,24], the trade off weight β was set to 0.3 over all the tasks. For the E2E configuration, we use a similar setup in our work [14] in the Uyghur ASR experiment, and [24] for the Librispeech task. All the E2E models are trained by using ESPnet1 [27] on 4 P40 GPUs for the Uyghur task and 8 M40 GPUs for the English task.…”

Section: Experiments Setupmentioning

confidence: 99%

See 2 more Smart Citations

PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition

Ma¹,

Hu²,

Yolwas³

et al. 2021

Preprint

View full text Add to dashboard Cite

Consonant and vowel reduction are often encountered in Uyghur speech, which might cause performance degradation in Uyghur automatic speech recognition (ASR). Our recently proposed learning strategy based on masking, Phone Masking Training (PMT), alleviates the impact of such phenomenon in Uyghur ASR. Although PMT achieves remarkably improvements, there still exists room for further gains due to the granularity mismatch between masking unit of PMT (phoneme) and modeling unit (word-piece). To boost the performance of PMT, we propose multi-modeling unit training (MMUT) architecture fusion with PMT (PM-MMUT). The idea of MMUT framework is to split the Encoder into two parts including acoustic feature sequences to phoneme-level representation (AF-to-PLR) and phoneme-level representation to word-piece-level representation (PLR-to-WPLR). It allows AF-to-PLR to be optimized by an intermediate phoneme-based CTC loss to learn the rich phoneme-level context information brought by PMT. Experimental results on Uyghur ASR show that the proposed approaches improve significantly, outperforming the pure PMT (reduction WER from 24.0 to 23.7 on Read-Test and from 38.4 to 36.8 on Oral-Test respectively). We also conduct experiments on the 960-hour Librispeech benchmark using ESPnet1, which achieves about 10% relative WER on all the test sets without LM fusion comparing with the latest official ESPnet1 pre-trained model.

show abstract

What Does Your Face Sound Like? 3D Face Shape towards Voice

Yang

Shan

et al. 2023

AAAI

View full text Add to dashboard Cite

Face-based speech synthesis provides a practical solution to generate voices from human faces. However, directly using 2D face images leads to the problems of uninterpretability and entanglement. In this paper, to address the issues, we introduce 3D face shape which (1) has an anatomical relationship between voice characteristics, partaking in the "bone conduction" of human timbre production, and (2) is naturally independent of irrelevant factors by excluding the blending process. We devise a three-stage framework to generate speech from 3D face shapes. Fully considering timbre production in anatomical and acquired terms, our framework incorporates three additional relevant attributes including face texture, facial features, and demographics. Experiments and subjective tests demonstrate our method can generate utterances matching faces well, with good audio quality and voice diversity. We also explore and visualize how the voice changes with the face. Case studies show that our method upgrades the face-voice inference to personalized custom-made voice creating, revealing a promising prospect in virtual human and dubbing applications.

show abstract

Recent Developments on ESPnet Toolkit Boosted by Conformer

Cited by 31 publications

References 22 publications

Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition

What Does Your Face Sound Like? 3D Face Shape towards Voice

Contact Info

Product

Resources

About