Increasing speech intelligibility and naturalness in noise based on concepts of modulation spectrum and modulation transfer function

Ngo, Thuan Van; Kubo, Rieko; Akagi, Masato

doi:10.1016/j.specom.2021.09.004

Cited by 5 publications

(2 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we mainly focused on improving TTS speech intelligibility. In related work, it has been suggested that speech intelligibility and naturalness do not always imply each other [52], and thus improvement in intelligibility might not necessarily improve naturalness. In overall, our subjective evaluation results revealed that the proposed systems achieved a significant improvement in speech intelligibility while also preserving speech naturalness.…”

Section: E Subjective Evaluationmentioning

confidence: 99%

A Machine Speech Chain Approach for Dynamically Adaptive Lombard TTS in Static and Dynamic Noise Environments

Novitasari

Sakti

Nakamura

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Recent end-to-end text-to-speech synthesis (TTS) systems have successfully synthesized high-quality speech. However, TTS speech intelligibility degrades in noisy environments because most of these systems were not designed to handle noisy environments. Several works attempted to address this problem by using offline fine-tuning to adapt their TTS to noisy conditions. Unlike machines, humans never perform offline fine-tuning. Instead, they speak with the Lombard effect in noisy places, where they dynamically adjust their vocal effort to improve the audibility of their speech. This ability is supported by the speech chain mechanism, which involves auditory feedback passing from speech perception to speech production. This paper proposes an alternative approach to TTS in noisy environments that is closer to the human Lombard effect. Specifically, we implement Lombard TTS in a machine speech chain framework to synthesize speech with dynamic adaptation. Our TTS performs adaptation by generating speech utterances based on the auditory feedback that consists of the automatic speech recognition (ASR) loss as the speech intelligibility measure and the speech-to-noise ratio (SNR) prediction as power measurement. Two versions of TTS are investigated: non-incremental TTS with utterancelevel feedback and incremental TTS (ITTS) with short-term feedback to reduce the delay without significant performance loss. Furthermore, we evaluate the TTS systems in both static and dynamic noise conditions. Our experimental results show that auditory feedback enhanced the TTS speech intelligibility in noise.

show abstract

Section: E Subjective Evaluationmentioning

confidence: 99%

A Machine Speech Chain Approach for Dynamically Adaptive Lombard TTS in Static and Dynamic Noise Environments

Novitasari

Sakti

Nakamura

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…The Lombard effect is known to impact the performance of speech recognition systems unfavorably. Various researchers have analyzed Lombard speech produced in different types and levels of noise for speech intelligibility [ 10 , 11 , 12 ], audio and audio-visual speech recognition [ 13 , 14 , 15 , 16 ], speaker recognition [ 17 , 18 , 19 ], and emotional speech analysis [ 20 ]. Overall, an automatic speech recognition system (ASR) performance may be degraded when Lombard speech is present in the speech signal [ 15 , 16 , 21 , 22 , 23 , 24 ].…”

Section: Introductionmentioning

confidence: 99%

Detecting Lombard Speech Using Deep Learning Approach

Kąkol¹,

Korvel

Tamulevičius

et al. 2022

Sensors

View full text Add to dashboard Cite

Robust Lombard speech-in-noise detecting is challenging. This study proposes a strategy to detect Lombard speech using a machine learning approach for applications such as public address systems that work in near real time. The paper starts with the background concerning the Lombard effect. Then, assumptions of the work performed for Lombard speech detection are outlined. The framework proposed combines convolutional neural networks (CNNs) and various two-dimensional (2D) speech signal representations. To reduce the computational cost and not resign from the 2D representation-based approach, a strategy for threshold-based averaging of the Lombard effect detection results is introduced. The pseudocode of the averaging process is also included. A series of experiments are performed to determine the most effective network structure and the 2D speech signal representation. Investigations are carried out on German and Polish recordings containing Lombard speech. All 2D signal speech representations are tested with and without augmentation. Augmentation means using the alpha channel to store additional data: gender of the speaker, F0 frequency, and first two MFCCs. The experimental results show that Lombard and neutral speech recordings can clearly be discerned, which is done with high detection accuracy. It is also demonstrated that the proposed speech detection process is capable of working in near real-time. These are the key contributions of this work.

show abstract

Increasing Speech Intelligibility by Mimicking Professional Announcers’ Voices and Its Physical Correlates

Tran,

Akagi,

Unoki

2023

2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

View full text Add to dashboard Cite

Increasing speech intelligibility and naturalness in noise based on concepts of modulation spectrum and modulation transfer function

Cited by 5 publications

References 28 publications

A Machine Speech Chain Approach for Dynamically Adaptive Lombard TTS in Static and Dynamic Noise Environments

A Machine Speech Chain Approach for Dynamically Adaptive Lombard TTS in Static and Dynamic Noise Environments

Detecting Lombard Speech Using Deep Learning Approach

Increasing Speech Intelligibility by Mimicking Professional Announcers’ Voices and Its Physical Correlates

Contact Info

Product

Resources

About