“…To address the challenges of EL speech enhancement, both frame-to-frame and seq-to-seq [48] mapping paradigms can be applied. Seq-to-seq VC models, utilizing an attentionbased encoder-decoder architecture [33], can perform repre-sentation learning and alignment simultaneously [30], capturing long-term dependencies such as prosody and speaker identity [32]. Some research has demonstrated the potential of using TTS pretraining in conjunction with seq-to-seq modeling for EL speech enhancement [30], [31].…”
Section: B Vc-based Statistical F 0 Prediction and Voicing State Controlmentioning
confidence: 99%
“…Seq-to-seq VC models, utilizing an attentionbased encoder-decoder architecture [33], can perform repre-sentation learning and alignment simultaneously [30], capturing long-term dependencies such as prosody and speaker identity [32]. Some research has demonstrated the potential of using TTS pretraining in conjunction with seq-to-seq modeling for EL speech enhancement [30], [31]. However, most seq-to-seq models require a substantial amount of highquality parallel training data.…”
Section: B Vc-based Statistical F 0 Prediction and Voicing State Controlmentioning
confidence: 99%
“…DL-based techniques have demonstrated success in alaryngeal speech enhancement [8], [26]- [29], notably in mapping EL speech spectral features into natural F 0 patterns using recurrent neural networks (RNNs) [8], [28]. Furthermore, sequence-to-sequence (seq2seq) modeling for EL to normal speech (EL2SP) conversion has been proposed [30], [31] using text-to-speech (TTS) pretraining [32] and the attention-based encoder-decoder framework [33].…”
Total laryngectomy (TL) stands as a well-established treatment for advanced laryngeal malignancies, entailing the complete removal of the larynx. Speech rehabilitation following TL is crucial for improving the quality of life (QoL) and facilitating social reintegration. Electrolaryngeal (EL) speech, a widely-used voice restoration technique utilizing external excitation signals, often produces artificial and monotonous sound quality despite enabling patients to form lengthy sentences. Efforts to enhance EL speech include the application of statistical voice conversion (VC) and neural approaches to speech enhancement. These approaches typically aim to map spectral features into acoustic characteristics, including the fundamental frequency (F 0 ). However, challenges arise due to substantial discrepancies and pattern changes between extracted features for EL and normal speech, compounded by limited clinical training data. To address this issue, we explored F 0 pattern prediction based on frame-wise phoneme information using bidirectional long short-term memory (BiLSTM) recurrent neural networks. Beyond direct predictions based on phoneme labels, we expanded our analysis to include real-valued phoneme embeddings, and conducted predictions for clustered embeddings representing lower-dimensional input representations. Our findings demonstrate that both regression and classification predictive modeling can map frame-wise phoneme information into natural F 0 patterns. Additionally, phoneme labels can be considered as shared features between EL and normal speech, allowing for improved prediction accuracies by incorporating phoneme information from normal speech into the training sets for EL speech. Furthermore, by learning of phoneme embeddings and creating input features for F 0 prediction based on the clustering of these embeddings, accurate F 0 patterns can be predicted, and the challenge of finding a strategy to reduce the dimensionality of the input features can be effectively alleviated.INDEX TERMS Electrolaryngeal speech, fundamental frequency prediction, phoneme labels, phoneme embeddings, speech enhancement.Following TL, the pharynx is decoupled from the trachea, and inhalation and exhalation occur through an opening in
“…To address the challenges of EL speech enhancement, both frame-to-frame and seq-to-seq [48] mapping paradigms can be applied. Seq-to-seq VC models, utilizing an attentionbased encoder-decoder architecture [33], can perform repre-sentation learning and alignment simultaneously [30], capturing long-term dependencies such as prosody and speaker identity [32]. Some research has demonstrated the potential of using TTS pretraining in conjunction with seq-to-seq modeling for EL speech enhancement [30], [31].…”
Section: B Vc-based Statistical F 0 Prediction and Voicing State Controlmentioning
confidence: 99%
“…Seq-to-seq VC models, utilizing an attentionbased encoder-decoder architecture [33], can perform repre-sentation learning and alignment simultaneously [30], capturing long-term dependencies such as prosody and speaker identity [32]. Some research has demonstrated the potential of using TTS pretraining in conjunction with seq-to-seq modeling for EL speech enhancement [30], [31]. However, most seq-to-seq models require a substantial amount of highquality parallel training data.…”
Section: B Vc-based Statistical F 0 Prediction and Voicing State Controlmentioning
confidence: 99%
“…DL-based techniques have demonstrated success in alaryngeal speech enhancement [8], [26]- [29], notably in mapping EL speech spectral features into natural F 0 patterns using recurrent neural networks (RNNs) [8], [28]. Furthermore, sequence-to-sequence (seq2seq) modeling for EL to normal speech (EL2SP) conversion has been proposed [30], [31] using text-to-speech (TTS) pretraining [32] and the attention-based encoder-decoder framework [33].…”
Total laryngectomy (TL) stands as a well-established treatment for advanced laryngeal malignancies, entailing the complete removal of the larynx. Speech rehabilitation following TL is crucial for improving the quality of life (QoL) and facilitating social reintegration. Electrolaryngeal (EL) speech, a widely-used voice restoration technique utilizing external excitation signals, often produces artificial and monotonous sound quality despite enabling patients to form lengthy sentences. Efforts to enhance EL speech include the application of statistical voice conversion (VC) and neural approaches to speech enhancement. These approaches typically aim to map spectral features into acoustic characteristics, including the fundamental frequency (F 0 ). However, challenges arise due to substantial discrepancies and pattern changes between extracted features for EL and normal speech, compounded by limited clinical training data. To address this issue, we explored F 0 pattern prediction based on frame-wise phoneme information using bidirectional long short-term memory (BiLSTM) recurrent neural networks. Beyond direct predictions based on phoneme labels, we expanded our analysis to include real-valued phoneme embeddings, and conducted predictions for clustered embeddings representing lower-dimensional input representations. Our findings demonstrate that both regression and classification predictive modeling can map frame-wise phoneme information into natural F 0 patterns. Additionally, phoneme labels can be considered as shared features between EL and normal speech, allowing for improved prediction accuracies by incorporating phoneme information from normal speech into the training sets for EL speech. Furthermore, by learning of phoneme embeddings and creating input features for F 0 prediction based on the clustering of these embeddings, accurate F 0 patterns can be predicted, and the challenge of finding a strategy to reduce the dimensionality of the input features can be effectively alleviated.INDEX TERMS Electrolaryngeal speech, fundamental frequency prediction, phoneme labels, phoneme embeddings, speech enhancement.Following TL, the pharynx is decoupled from the trachea, and inhalation and exhalation occur through an opening in
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.