Integrating Articulatory Information in Deep Learning-Based Text-to-Speech Synthesis

Cao, Beiming; Kim, Myung Jong; Santen, Jan P. H. van; Mau, Ted; Wang, Jun

doi:10.21437/interspeech.2017-1762

Cited by 12 publications

(14 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A direct neural speech decoding approach may improve efficacy by providing a faster communication rate than the current BCIs. In this framework, once the imaginedor intended-speech is generated internally, these signals are then decoded to text or speech parameters, and then a text-to-speech synthesizer (Cao et al, 2017) or a vocoder (Akbari et al, 2019) can be used to construct speech immediately.…”

Section: Introductionmentioning

confidence: 99%

Decoding Imagined and Spoken Phrases From Non-invasive Neural (MEG) Signals

2020

Self Cite

View full text Add to dashboard Cite

Speech production is a hierarchical mechanism involving the synchronization of the brain and the oral articulators, where the intention of linguistic concepts is transformed into meaningful sounds. Individuals with locked-in syndrome (fully paralyzed but aware) lose their motor ability completely including articulation and even eyeball movement. The neural pathway may be the only option to resume a certain level of communication for these patients. Current brain-computer interfaces (BCIs) use patients' visual and attentional correlates to build communication, resulting in a slow communication rate (a few words per minute). Direct decoding of imagined speech from the neural signals (and then driving a speech synthesizer) has the potential for a higher communication rate. In this study, we investigated the decoding of five imagined and spoken phrases from single-trial, non-invasive magnetoencephalography (MEG) signals collected from eight adult subjects. Two machine learning algorithms were used. One was an artificial neural network (ANN) with statistical features as the baseline approach. The other was convolutional neural networks (CNNs) applied on the spatial, spectral and temporal features extracted from the MEG signals. Experimental results indicated the possibility to decode imagined and spoken phrases directly from neuromagnetic signals. CNNs were found to be highly effective with an average decoding accuracy of up to 93% for the imagined and 96% for the spoken phrases.

show abstract

Section: Introductionmentioning

confidence: 99%

Decoding Imagined and Spoken Phrases From Non-invasive Neural (MEG) Signals

2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Speech is produced as a result of temporal overlap of articulatory gestures namely, lips, tongue tip, tongue body, tongue dorsum, velum, and larynx, which regulate constriction in different parts of the vocal tract [1]. Knowledge of articulatory kinematics together with acoustic information have shown benefit in various applications like, speech recognition [2,3], speech synthesis [4,5], speaker verification [6] and multimedia applications [7,8,9]. With the advancements in deep learning techniques, articulatory information has also shown success in silent speech interfaces (benefit patients who have lost their voice due to laryngectomy or diseases affecting the vocal folds) such as in speech recognition [10] and speech synthesis directly from articulatory position information alone [11,12].…”

Section: Introductionmentioning

confidence: 99%

An Investigation on Speaker Specific Articulatory Synthesis with Speaker Independent Articulatory Inversion

Illa

Ghosh

2019

Interspeech 2019

View full text Add to dashboard Cite

Estimating speech representations from articulatory movements is known as articulatory-to-acoustic forward (AAF) mapping. Typically this mapping is learned using directly measured articulatory movement in a subject-specific manner. Such AAF mapping has been shown to benefit the speech synthesis applications. In this work, we investigate the speaker similarity and naturalness of utterances generated by AAF which is driven by the articulatory movements from a subject (referred to as cross speaker) different from the speaker (target speaker) used for training AAF mapping. Experiments are performed with directly measured articulatory data from 9 speakers (8 target speakers and 1 cross speaker), which are recorded using Electromagnetic articulograph AG501. Experiments are also performed with articulatory features estimated using speaker independent acoustic-to-articulatory inversion (SI-AAI) model trained on 26 reference speakers. Objective evaluation on target speakers reveal that the articulatory features estimated from SI-AAI result in a lower Mel-cepstrum distortion compared to that using directly measured articulatory features. Further, listening tests reveal that the directly measured articulatory movements preserve the speaker similarity better than estimated ones. Although, for naturalness, articulatory movements predicted by SI-AAI perform better than the direct measurements.

show abstract

“…Text-to-speech synthesis (TTS) then plays synthesized sounds based on the recognized text, which is well studied and is ready for this application (e.g., [15]). Researchers on TTS are currently exploring how to restore the laryngectomee’s own voice [5, 55] with limited training data. Thus, the core problem in current SSI research is developing effective algorithms of silent speech recognition (SSR) that map articulatory movements to text.…”

Section: Introductionmentioning

confidence: 99%

Speaker-Independent Silent Speech Recognition From Flesh-Point Articulatory Movements Using an LSTM Neural Network

Kim

Cao

Mau

et al. 2017

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Silent speech recognition (SSR) converts non-audio information such as articulatory movements into text. SSR has the potential to enable persons with laryngectomy to communicate through natural spoken expression. Current SSR systems have largely relied on speaker-dependent recognition models. The high degree of variability in articulatory patterns across different speakers has been a barrier for developing effective speaker-independent SSR approaches. Speaker-independent SSR approaches, however, are critical for reducing the amount of training data required from each speaker. In this paper, we investigate speaker-independent SSR from the movements of flesh points on tongue and lip with articulatory normalization methods that reduce the inter-speaker variation. To minimize the across-speaker physiological differences of the articulators, we propose Procrustes matching-based articulatory normalization by removing locational, rotational, and scaling differences. To further normalize the articulatory data, we apply feature-space maximum likelihood linear regression and i-vector. In this paper, we adopt a bidirectional long short term memory recurrent neural network (BLSTM) as an articulatory model to effectively model the articulatory movements with long-range articulatory history. A silent speech data set with flesh points was collected using an electromagnetic articulograph (EMA) from twelve healthy and two laryngectomized English speakers. Experimental results showed the effectiveness of our speaker-independent SSR approaches on healthy as well as laryngectomy speakers. In addition, BLSTM outperformed standard deep neural network. The best performance was obtained by BLSTM with all the three normalization approaches combined.

show abstract

Integrating Articulatory Information in Deep Learning-Based Text-to-Speech Synthesis

Cited by 12 publications

References 27 publications

Decoding Imagined and Spoken Phrases From Non-invasive Neural (MEG) Signals

Decoding Imagined and Spoken Phrases From Non-invasive Neural (MEG) Signals

An Investigation on Speaker Specific Articulatory Synthesis with Speaker Independent Articulatory Inversion

Speaker-Independent Silent Speech Recognition From Flesh-Point Articulatory Movements Using an LSTM Neural Network

Contact Info

Product

Resources

About