Speech Enhancement Algorithm Based on a Convolutional Neural Network Reconstruction of the Temporal Envelope of Speech in Noisy Environments

Soleymanpour, Rahim; Soleymanpour, Mohammad; Brammer, Anthony J.; Johnson, Michael T.; Kim, In-Soo

doi:10.1109/access.2023.3236242

Cited by 11 publications

(7 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CNNs have been used to achieve state-of-the-art performance in many image and video analysis tasks, such as object detection, image classification, and video classification. 34 –36…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

“Great food, but the noise?”: Relationship between perceived sound quality survey and non acoustical factors in one hotel restaurant in Italy

Loreto,

Lori,

Serpilli

et al. 2023

Building Acoustics

View full text Add to dashboard Cite

The empirical assessment of the quality of the sound landscape in restaurants and how it affects the overall customer experience has only recently begun to be addressed. Noise from recreational activities, such as discos, restaurants, bars, hotels, etc., characterizes the soundscape of tourist centers, where there is a need to ensure the coexistence of noisy commercial activities and the right to rest of citizens customers of accommodation facilities, often already compromised by traffic. Although there is a renewed sensitivity to the issue of noise, restaurants are often characterized by poor acoustics and a poor multi-sensory perception of comfort. This article examines the noise levels of a hotel’s dining room in an Italian tourist resort during lunch and dinner hours. The study included aspects related to the acoustics, lighting, and quality of the location, also for the assessment of the perception of comfort obtained was given to customers a questionnaire of satisfaction during the meal, every day for a week. Considering the correlation between the three components of the soundscape, namely people, sound, and the environment under consideration, this study aims to explore the effects of the soundscape on sound restoration for accommodation facilities in tourist resorts.

show abstract

“…CNNs have been used to achieve state-of-the-art performance in many image and video analysis tasks, such as object detection, image classification, and video classification. 34 –36…”

Section: Resultsmentioning

confidence: 99%

“…CNNs have been used to achieve state-of-the-art performance in many image and video analysis tasks, such as object detection, image classification, and video classification. [34][35][36] Analysis of the questionnaires and the audio track recorded during the stop in the restaurant were used to identify noise sources.…”

Section: Keynotes Of the Restaurant Roommentioning

confidence: 99%

“Great food, but the noise?”: Relationship between perceived sound quality survey and non acoustical factors in one hotel restaurant in Italy

Loreto,

Lori,

Serpilli

et al. 2023

Building Acoustics

View full text Add to dashboard Cite

show abstract

“…In Soleymanpour et al (2023) , speech enhancement in a single channel was implemented using CNN algorithms for complex noisy speeches to improve the speech quality ( Passricha & Aggarwal, 2019 ) which produces the following result; PESQ = 3.24 ( Wang & Wang, 2019 ; Park & Lee, 2017 ), CSIG (signal distortion) = 4.34 ( Pandey & Wang, 2019 ; Germain, Chen & Koltun, 2019 ), CBAK (background noise interference) = 4.10 ( Fu et al, 2018 ; Rownicka, Bell & Renals, 2020 ), COVL (overall quality of speech) = 3.81 ( Rethage, Pons & Serra, 2018 ), and SSNR (Segmented Signal to Noise Ratio) = 16.85 ( Choi et al, 2019 ). Additionally, CNN was said to be more effective than recursive neural networks (RNNs) ( Park & Lee, 2017 ) and traditional feedforward neural networks ( Oord et al, 2016 ).…”

Section: Research Backgroundmentioning

confidence: 99%

CNN-based noise reduction for multi-channel speech enhancement system with discrete wavelet transform (DWT) preprocessing

Cherukuru,

Mustafa

2024

PeerJ Computer Science

View full text Add to dashboard Cite

Speech enhancement algorithms are applied in multiple levels of enhancement to improve the quality of speech signals under noisy environments known as multi-channel speech enhancement (MCSE) systems. Numerous existing algorithms are used to filter noise in speech enhancement systems, which are typically employed as a pre-processor to reduce noise and improve speech quality. They may, however, be limited in performing well under low signal-to-noise ratio (SNR) situations. The speech devices are exposed to all kinds of environmental noises which may go up to a high-level frequency of noises. The objective of this research is to conduct a noise reduction experiment for a multi-channel speech enhancement (MCSE) system in stationary and non-stationary environmental noisy situations with varying speech signal SNR levels. The experiments examined the performance of the existing and the proposed MCSE systems for environmental noises in filtering low to high SNRs environmental noises (−10 dB to 20 dB). The experiments were conducted using the AURORA and LibriSpeech datasets, which consist of different types of environmental noises. The existing MCSE (BAV-MCSE) makes use of beamforming, adaptive noise reduction and voice activity detection algorithms (BAV) to filter the noises from speech signals. The proposed MCSE (DWT-CNN-MCSE) system was developed based on discrete wavelet transform (DWT) preprocessing and convolution neural network (CNN) for denoising the input noisy speech signals to improve the performance accuracy. The performance of the existing BAV-MCSE and the proposed DWT-CNN-MCSE were measured using spectrogram analysis and word recognition rate (WRR). It was identified that the existing BAV-MCSE reported the highest WRR at 93.77% for a high SNR (at 20 dB) and 5.64% on average for a low SNR (at −10 dB) for different noises. The proposed DWT-CNN-MCSE system has proven to perform well at a low SNR with WRR of 70.55% and the highest improvement (64.91% WRR) at −10 dB SNR.

show abstract

“…In addition, they retain complex spectral structures in the final speech. Feedforward DNNs (FDNNs) [5]- [9], [13], [16], [17], CNNs [18], [19], RNNs [20], [21], Generative Adversarial Network (GANs) [22], [23], and Transformers [24]- [26] are successful DNN approaches for speech enhancement.…”

Section: Ref#mentioning

confidence: 99%

Multi-Attention Bottleneck for Gated Convolutional Encoder-Decoder-Based Speech Enhancement

Saleem,

Gunawan,

Shafi

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Convolutional encoder-decoder (CED) has emerged as a powerful architecture, particularly in speech enhancement (SE), which aims to improve the intelligibility and quality and intelligibility of noise-contaminated speech. This architecture leverages the strength of the convolutional neural networks (CNNs) in capturing high-level features. Usually, the CED architectures use the gated recurrent unit (GRU) or long-short-term memory (LSTM) as a bottleneck to capture temporal dependencies, enabling a SE model to effectively learn the dynamics and long-term temporal dependencies in the speech signal. However, Transformers neural networks with self-attention effectively capture long-term temporal dependencies. This study proposes a multi-attention bottleneck (MAB) comprised of a self-attention Transformer powered by a time-frequency attention (TFA) module followed by a channel attention module (CAM) to focus on the important features. The proposed bottleneck (MAB) is integrated into a CED architecture and named MAB-CED. The MAB-CED uses an encoder-decoder structure including a shared encoder and two decoders, where one decoder is dedicated to spectral masking and the other is used for spectral mapping. Convolutional Gated Linear Units (ConvGLU) and Deconvolutional Gated Linear Units (DeconvGLU) are used to construct the encoder-decoder framework. The outputs of two decoders are coupled by applying coherent averaging to synthesize the enhanced speech signal. The proposed speech enhancement is examined using two databases, VoiceBank+DEMAND and LibriSpeech. The results show that the proposed speech enhancement outperforms the benchmarks in terms of intelligibility and quality at various input SNRs. This indicates the performance of the proposed MAB-CED at improving the average PESQ by 0.55 (22.85%) with VoiceBank+DEMAND and by 0.58 (23.79%) with LibriSpeech. The average STOI is improved by 9.63% (VoiceBank+DEMAND) and 9.78% ( LibriSpeech) over the noisy mixtures.

show abstract

Speech Enhancement Algorithm Based on a Convolutional Neural Network Reconstruction of the Temporal Envelope of Speech in Noisy Environments

Cited by 11 publications

References 50 publications

“Great food, but the noise?”: Relationship between perceived sound quality survey and non acoustical factors in one hotel restaurant in Italy

“Great food, but the noise?”: Relationship between perceived sound quality survey and non acoustical factors in one hotel restaurant in Italy

CNN-based noise reduction for multi-channel speech enhancement system with discrete wavelet transform (DWT) preprocessing

Multi-Attention Bottleneck for Gated Convolutional Encoder-Decoder-Based Speech Enhancement

Contact Info

Product

Resources

About