EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Chen, Yu‐Wen; Hung, Kuo-Hsuan; Chuang, Shang-Yi; Sherman, Jonathan H.; Huang, Wen-Chin; Lu, Xugang; Tsao, Yu

doi:10.1109/iscas51556.2021.9401485

Cited by 4 publications

(3 citation statements)

References 29 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The most commonly used signal is lip reading [17], [18]. Other speech-related biosignals, such as sEMG [3], [6], EMA [4], PMA [5], [8], and ultrasound images [9], [19], have also been reported. In contrast, speech enhancement is designed to improve speech quality and intelligibility in noisy environments, thereby improving the robustness of the system to environmental noise.…”

Section: A Speech Generation and Speech Enhancementmentioning

confidence: 99%

mentioning

confidence: 99%

“…Based on recent advances in machine learning-based technologies, the conversion of biosignals to speech signals has been reported in several studies [3], [4], [5]. Various signals have been considered for speech generation and enhancement, including surface electromyography (sEMG) [3], [6], electromagnetic articulography (EMA) [4], [7], permanent magnetic articulography (PMA) [5], [8], ultrasound tongue imaging [9], [10], Doppler signals [11], [12], visual cues [13], [14], and bone-conducted microphone signals [15]. Further, multimodal learning has been leveraged to integrate information from complementary data, such as text [16], videos [13], boneconducted microphone signals [15], and articulatory movements [4].However, the transformation of articulatory movements to facilitate communication has not yet been adequately researched.…”

mentioning

confidence: 99%

See 2 more Smart Citations

EPG2S: Speech Generation and Speech Enhancement Based on Electropalatography and Audio Signals Using Multimodal Learning

Chen

Tsai

et al. 2022

IEEE Signal Process. Lett.

Self Cite

View full text Add to dashboard Cite

Speech generation and enhancement based on articulatory movements facilitate communication when the scope of verbal communication is absent, e.g., in patients who have lost the ability to speak. Although various techniques have been proposed to this end, electropalatography (EPG), which is a monitoring technique that records contact between the tongue and hard palate during speech, has not been adequately explored. Herein, we propose a novel multimodal EPG-to-speech (EPG2S) system that utilizes EPG and speech signals for speech generation and enhancement. Different fusion strategies based on multiple combinations of EPG and noisy speech signals are examined, and the viability of the proposed method is investigated. Experimental results indicate that EPG2S achieves desirable speech generation outcomes based solely on EPG signals. Further, the addition of noisy speech signals is observed to improve quality and intelligibility. Additionally, EPG2S is observed to achieve highquality speech enhancement based solely on audio signals, with the addition of EPG signals further improving the performance. The late fusion strategy is deemed to be the most effective approach for simultaneous speech generation and enhancement.

show abstract