Data Augmentation for End-to-end Silent Speech Recognition for Laryngectomees

Cao, Beiming; Teplansky, Kristin; Sebkhi, Nordine; Bhavsar, Arpan; Inan, Omer T.; Samlan, Robin A.; Mau, Ted; Wang, Jun

doi:10.21437/interspeech.2022-10868

Cited by 4 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The amplitude scaling factor was set at 0.02, the number of oscillations per second (Hz) of the noise was 40, and the phase was set to zero. For example, if the mean amplitude of an articulatory dimension was A, the sinusoidal noise that was added would 4) Random scaling (RS): A recent study [21], has shown the implementation of random scaling over EMA signals. With this in mind, we decided to explore a data augmentation strategy that involved altering the duration of the samples by randomly stretching or shrinking them on our ultrasound tongue image dataset (Fig.…”

Section: ) Consecutive Time Masking (Ctm)mentioning

confidence: 99%

“…Therefore, data augmentation plays a crucial role in SSI with UTI by providing additional training examples to prevent overfitting and enhance the performance of deep learning models, given the limited amount of available data. Data augmentation has been proposed as a method to generate additional training data for end-to-end SSR on EMA datasets: Cao and his colleagues applied data augmentation strategies to raw kinematic signals [21].…”

Section: Introductionmentioning

confidence: 99%

“…The Titan X GPU used was donated by NVIDIA. We would like to thank the ex-MTA-ELTE Lingual Articulation Research Group for providing the equipment necessary for the articulatory recordings and for Beiming Cao and his colleagues at the University of Texas at Austin, USA for the excellent research idea [21].…”

mentioning

confidence: 99%

See 2 more Smart Citations

A Comparison of Data Augmentation Methods on Ultrasound Tongue Images for Articulatory- to-Acoustic Mapping towards Silent Speech Interfaces

Ibrahimov¹,

Gosztolya²,

Csapó³

2023

1st Workshop on Intelligent Infocommunication Networks, Systems and Services

View full text Add to dashboard Cite

Silent Speech Interfaces (SSI), being a subfield of speech technology, break the limitations of automatic speech recognition when acoustic signals cannot be produced or clearly captured. SSI focuses on the articulation process of speech production in order to map articulatory data into acoustics. Ultrasound tongue imaging (UTI), a non-invasive, clinically safe technique to view the shape, position, and movements of the tongue, has recently become popular in the process of collecting articulatory data of the tongue movement. Despite advancements in the field of SSI, the majority of related research has been conducted using limited datasets due to challenges in acquiring additional information, which results in overfitting. It has already been shown that data augmentation can be helpful for solving the overfitting problem and improving the generalization ability of deep neural networks. In this paper, we discuss the preliminary implementation and comparison of data augmentation methods on Azerbaijani ultrasound and speech recordings that has been recorded by our team. These strategies include consecutive and intermittent time masking, sinusoidal noise injection, and random scaling. We explore the generation of new data samples using the provided methods on the dataset. We use mean-squared error validation loss as an evaluation metric to measure the performance of all the above data augmentation methods.

show abstract

Section: ) Consecutive Time Masking (Ctm)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Comparison of Data Augmentation Methods on Ultrasound Tongue Images for Articulatory- to-Acoustic Mapping towards Silent Speech Interfaces

Ibrahimov¹,

Gosztolya²,

Csapó³

2023

1st Workshop on Intelligent Infocommunication Networks, Systems and Services

View full text Add to dashboard Cite

show abstract

“…Evaluators may need to consider multiple aspects of speech, such as pitch, tone, articulation, and prosody, which can make the evaluation more challenging [29]. Third, there is a limited amount of training data available for evaluating alaryngeal speech, as it is a relatively rare condition [30]. This can make it difficult to develop standardized evaluation methods and norms for different types of alaryngeal speech [31].…”

Section: Assessing Speech-signal Impairmentsmentioning

confidence: 99%

Pareto-Optimized Non-Negative Matrix Factorization Approach to the Cleaning of Alaryngeal Speech Signals

Maskeliūnas,

Damaševičius,

Kulikajevas

et al. 2023

Cancers

View full text Add to dashboard Cite

The problem of cleaning impaired speech is crucial for various applications such as speech recognition, telecommunication, and assistive technologies. In this paper, we propose a novel approach that combines Pareto-optimized deep learning with non-negative matrix factorization (NMF) to effectively reduce noise in impaired speech signals while preserving the quality of the desired speech. Our method begins by calculating the spectrogram of a noisy voice clip and extracting frequency statistics. A threshold is then determined based on the desired noise sensitivity, and a noise-to-signal mask is computed. This mask is smoothed to avoid abrupt transitions in noise levels, and the modified spectrogram is obtained by applying the smoothed mask to the signal spectrogram. We then employ a Pareto-optimized NMF to decompose the modified spectrogram into basis functions and corresponding weights, which are used to reconstruct the clean speech spectrogram. The final noise-reduced waveform is obtained by inverting the clean speech spectrogram. Our proposed method achieves a balance between various objectives, such as noise suppression, speech quality preservation, and computational efficiency, by leveraging Pareto optimization in the deep learning model. The experimental results demonstrate the effectiveness of our approach in cleaning alaryngeal speech signals, making it a promising solution for various real-world applications.

show abstract

“…In dealing with these urgent challenges in sustainable urban living, artificial intelligence (AI)-based applications play an important role. State-of-the-art AI-based technologies in image processing [1][2][3], video processing [4,5], speech and audio processing [6][7][8][9], music processing [10], natural language processing [11], multimodality processing [12][13][14], Internet of Things [15], edge computing [16], autonomous driving [17], heterogeneous computing [18][19][20], wireless networks [21][22][23], social science [24] and smart healthcare [25][26][27][28] could be helpful in adding intelligence to urban living and will provide better solutions to address challenges in sustainable urban living.…”

Section: Introductionmentioning

confidence: 99%

Introducing the Special Issue on Artificial Intelligence Applications for Sustainable Urban Living

et al. 2022

View full text Add to dashboard Cite

show abstract

Data Augmentation for End-to-end Silent Speech Recognition for Laryngectomees

Cited by 4 publications

References 0 publications

A Comparison of Data Augmentation Methods on Ultrasound Tongue Images for Articulatory- to-Acoustic Mapping towards Silent Speech Interfaces

A Comparison of Data Augmentation Methods on Ultrasound Tongue Images for Articulatory- to-Acoustic Mapping towards Silent Speech Interfaces

Pareto-Optimized Non-Negative Matrix Factorization Approach to the Cleaning of Alaryngeal Speech Signals

Introducing the Special Issue on Artificial Intelligence Applications for Sustainable Urban Living

Contact Info

Product

Resources

About