Mandarin Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling

Yen, Ming-Chi; Huang, Wen-Chin; Kobayashi, Kenzo; Peng, Yuhuai; Tsai, Shu-Wei; Yu, Tao; Toda, Tomoki; Jang, Jyh-Shing Roger; Wang, Hsin‐Min

doi:10.1109/asru51503.2021.9687908

Cited by 6 publications

(3 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To address the challenges of EL speech enhancement, both frame-to-frame and seq-to-seq [48] mapping paradigms can be applied. Seq-to-seq VC models, utilizing an attentionbased encoder-decoder architecture [33], can perform repre-sentation learning and alignment simultaneously [30], capturing long-term dependencies such as prosody and speaker identity [32]. Some research has demonstrated the potential of using TTS pretraining in conjunction with seq-to-seq modeling for EL speech enhancement [30], [31].…”

Section: B Vc-based Statistical F 0 Prediction and Voicing State Controlmentioning

confidence: 99%

“…Seq-to-seq VC models, utilizing an attentionbased encoder-decoder architecture [33], can perform repre-sentation learning and alignment simultaneously [30], capturing long-term dependencies such as prosody and speaker identity [32]. Some research has demonstrated the potential of using TTS pretraining in conjunction with seq-to-seq modeling for EL speech enhancement [30], [31]. However, most seq-to-seq models require a substantial amount of highquality parallel training data.…”

Section: B Vc-based Statistical F 0 Prediction and Voicing State Controlmentioning

confidence: 99%

“…DL-based techniques have demonstrated success in alaryngeal speech enhancement [8], [26]- [29], notably in mapping EL speech spectral features into natural F 0 patterns using recurrent neural networks (RNNs) [8], [28]. Furthermore, sequence-to-sequence (seq2seq) modeling for EL to normal speech (EL2SP) conversion has been proposed [30], [31] using text-to-speech (TTS) pretraining [32] and the attention-based encoder-decoder framework [33].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

An Investigation of Fundamental Frequency Pattern Prediction for Japanese Electrolaryngeal Speech Enhancement Based on Frame-Wise Phoneme Representations

Eshghi,

Toda

2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

Total laryngectomy (TL) stands as a well-established treatment for advanced laryngeal malignancies, entailing the complete removal of the larynx. Speech rehabilitation following TL is crucial for improving the quality of life (QoL) and facilitating social reintegration. Electrolaryngeal (EL) speech, a widely-used voice restoration technique utilizing external excitation signals, often produces artificial and monotonous sound quality despite enabling patients to form lengthy sentences. Efforts to enhance EL speech include the application of statistical voice conversion (VC) and neural approaches to speech enhancement. These approaches typically aim to map spectral features into acoustic characteristics, including the fundamental frequency (F 0 ). However, challenges arise due to substantial discrepancies and pattern changes between extracted features for EL and normal speech, compounded by limited clinical training data. To address this issue, we explored F 0 pattern prediction based on frame-wise phoneme information using bidirectional long short-term memory (BiLSTM) recurrent neural networks. Beyond direct predictions based on phoneme labels, we expanded our analysis to include real-valued phoneme embeddings, and conducted predictions for clustered embeddings representing lower-dimensional input representations. Our findings demonstrate that both regression and classification predictive modeling can map frame-wise phoneme information into natural F 0 patterns. Additionally, phoneme labels can be considered as shared features between EL and normal speech, allowing for improved prediction accuracies by incorporating phoneme information from normal speech into the training sets for EL speech. Furthermore, by learning of phoneme embeddings and creating input features for F 0 prediction based on the clustering of these embeddings, accurate F 0 patterns can be predicted, and the challenge of finding a strategy to reduce the dimensionality of the input features can be effectively alleviated.INDEX TERMS Electrolaryngeal speech, fundamental frequency prediction, phoneme labels, phoneme embeddings, speech enhancement.Following TL, the pharynx is decoupled from the trachea, and inhalation and exhalation occur through an opening in

show abstract

Section: B Vc-based Statistical F 0 Prediction and Voicing State Controlmentioning

confidence: 99%

Section: B Vc-based Statistical F 0 Prediction and Voicing State Controlmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An Investigation of Fundamental Frequency Pattern Prediction for Japanese Electrolaryngeal Speech Enhancement Based on Frame-Wise Phoneme Representations

Eshghi,

Toda

2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

show abstract

Alaryngeal Speech Enhancement for Noisy Environments Using a Pareto Denoising Gated LSTM

Maskeliūnas,

Damaševičius,

Kulikajevas

et al. 2024

Journal of Voice

View full text Add to dashboard Cite

Data Selection Based on Phoneme Affinity Matrix for Electrolarynx Speech Recognition

Hsieh,

Wu,

Tsa

2023

2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

View full text Add to dashboard Cite

Mandarin Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling

Cited by 6 publications

References 23 publications

An Investigation of Fundamental Frequency Pattern Prediction for Japanese Electrolaryngeal Speech Enhancement Based on Frame-Wise Phoneme Representations

An Investigation of Fundamental Frequency Pattern Prediction for Japanese Electrolaryngeal Speech Enhancement Based on Frame-Wise Phoneme Representations

Alaryngeal Speech Enhancement for Noisy Environments Using a Pareto Denoising Gated LSTM

Data Selection Based on Phoneme Affinity Matrix for Electrolarynx Speech Recognition

Contact Info

Product

Resources

About