Abstract:This study investigates the relationship between the intelligibility and quality of modified speech in noise and in quiet. Speech signals were processed by seven algorithms designed to increase speech intelligibility in noise without altering speech intensity. In three noise maskers, including both stationary and fluctuating noise at two signal-to-noise ratios (SNR), listeners identified keywords from unmodified or modified sentences. The intelligibility performance of each type of speech was measured as the l… Show more
“…Although statistically significant, the improvement in the speech intelligibility metric (i.e., STOI) was not as prominent as in the two speech quality metrics (i.e., SDR and PESQ). This is probably because of ceiling effects; the SNR tested was high overall (starting from 1 dB SNR) and speech intelligibility was not a significant issue in these models of normal hearing (Tang et al, 2017). For the SepFormer model, the acoustic evaluation scores for the unprocessed noisy mixtures (dashed lines) remained the same since the test materials did not change, as shown in Figure 2d–2f.…”
Despite excellent performance in quiet, cochlear implants (CIs) only partially restore normal levels of intelligibility in noisy settings. Recent developments in machine learning have resulted in deep neural network (DNN) models that achieve noteworthy performance in speech enhancement and separation tasks. However, there are no commercially available CI audio processors that utilize DNN models for noise reduction. We implemented two DNN models intended for applications in CIs: (1) a recurrent neural network (RNN), which is a lightweight template model, and (2) SepFormer, which is the current top-performing speech separation model in the literature. The models were trained with a custom training dataset (30 hours) that included four configurations: speech in non-speech noise and speech in 1-talker, 2-talker, and 4-talker speech babble backgrounds. The enhancement of the target speech (or the suppression of the noise) by the models was evaluated by commonly used acoustic evaluation metrics of quality and intelligibility, including (1) signal-to-distortion ratio, (2) ``perceptual'' evaluation of speech quality, and (3) short-time objective intelligibility. Both DNN models yielded significant improvements in all acoustic metrics tested. The two DNN models were also evaluated with thirteen CI users using two types of background noise: (1) CCITT noise (speech-shaped stationary noise) and (2) 2-talker babble. Significant improvements in speech intelligibility were observed when the noisy speech was processed by the models, compared to the unprocessed conditions. This work serves as a proof of concept for the application of DNN technology in CIs for improved listening experience and speech comprehension in noisy environments.
“…Although statistically significant, the improvement in the speech intelligibility metric (i.e., STOI) was not as prominent as in the two speech quality metrics (i.e., SDR and PESQ). This is probably because of ceiling effects; the SNR tested was high overall (starting from 1 dB SNR) and speech intelligibility was not a significant issue in these models of normal hearing (Tang et al, 2017). For the SepFormer model, the acoustic evaluation scores for the unprocessed noisy mixtures (dashed lines) remained the same since the test materials did not change, as shown in Figure 2d–2f.…”
Despite excellent performance in quiet, cochlear implants (CIs) only partially restore normal levels of intelligibility in noisy settings. Recent developments in machine learning have resulted in deep neural network (DNN) models that achieve noteworthy performance in speech enhancement and separation tasks. However, there are no commercially available CI audio processors that utilize DNN models for noise reduction. We implemented two DNN models intended for applications in CIs: (1) a recurrent neural network (RNN), which is a lightweight template model, and (2) SepFormer, which is the current top-performing speech separation model in the literature. The models were trained with a custom training dataset (30 hours) that included four configurations: speech in non-speech noise and speech in 1-talker, 2-talker, and 4-talker speech babble backgrounds. The enhancement of the target speech (or the suppression of the noise) by the models was evaluated by commonly used acoustic evaluation metrics of quality and intelligibility, including (1) signal-to-distortion ratio, (2) ``perceptual'' evaluation of speech quality, and (3) short-time objective intelligibility. Both DNN models yielded significant improvements in all acoustic metrics tested. The two DNN models were also evaluated with thirteen CI users using two types of background noise: (1) CCITT noise (speech-shaped stationary noise) and (2) 2-talker babble. Significant improvements in speech intelligibility were observed when the noisy speech was processed by the models, compared to the unprocessed conditions. This work serves as a proof of concept for the application of DNN technology in CIs for improved listening experience and speech comprehension in noisy environments.
“…It was shown also in [18] that when listening in noise, modification performance on improving intelligibility is more important than its potential negative impact on speech quality. However, when listening in quiet or at SNRs in which intelligibility is no longer an issue to listeners, the impact on speech quality due to modification becomes a concern.…”
Estimates of speech quality and intelligibility for three university classrooms of small, medium and large sizes are presented. The quality and intelligibility of speech were assessed by objective methods using binaural room impulse responses, measured at 5-6 points of the premises. The measures of speech quality were log-spectral distortion (LSD), bark spectral distortion (BSD) and perceptual evaluation of speech quality (PESQ), and the objective measure of speech intelligibility was the speech transmission index (STI).
Among the quality measures considered, only BSD is shown to be highly correlated with STI measures for all three classrooms. In this case, correlation coefficient R varies from minus 0.6 for a small room to minus 0.98 for a large room. The close relationship between PESQ and STI is observed only in the case of a large classroom (R = 0.96-0.99), and the LSD measure was found to be uncorrelated with STI for premises of all sizes. The obtained results can serve as a justification for the use of BSD instead of STI, and vice versa, in the acoustic examination of classrooms of different sizes.
“…Another example is the preliminary high-frequency filtering of signals, which allows increasing the efficiency of automatic speech recognition systems [14]. To increase the intelligibility of speech masked by intense noise, it is possible to use algorithms for intentional distortion of speech signals in the time or spectral domain, or in both domains at once [15]. Decreased intelligibility and quality of speech when using speech enhancement algorithms is a known fact [16].…”
In this paper, five objective measures of the quality of speech signals distorted by reverberation are compared with the Speech Transmission Index (STI). The main aim of the comparison is to further test and explain the reasons for the previously discovered phenomenon of an increase in the speech quality and intelligibility with increasing room size. The comparison is performed for three university classrooms of small, medium and large sizes. The correlation coefficients between the quality and intelligibility estimates of speech obtained for 5-6 points of each room are estimated. Speech signal quality is assessed using intrusive measures such as segmental signal-to-noise ratio (SSNR), log-spectral distortion (LSD), frequency-weighted segmental signal-to-noise ratio (FWSNR), bark spectral distortion (BSD), and perceptual evaluation of speech quality (PESQ). For BSD, high correlation coefficients (0.57-0.99) are determined for rooms of all sizes and an increase in the correlation coefficient with the room size increase is found, which can be explained by a decrease in the density of early sound reflections. For FWSNR, high correlation (0.65-0.98) is determined for medium and large rooms. For PESQ, high correlation (0.96-0.99) is obtained for large classroom. SSNR and LSD are found to be uncorrelated with STI for rooms of all sizes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.