Role of mask pattern in intelligibility of ideal binary-masked noisy speech

Kjems, Ulrik; Boldt, Jesper B.; Pedersen, Michael; Lunner, Thomas; Wang, DeLiang

doi:10.1121/1.3179673

Cited by 152 publications

(203 citation statements)

References 24 publications

(20 reference statements)

Supporting

Mentioning

187

Contrasting

Order By: Relevance

“…Our method of applying an SNR-dependent binary mask to the target speech resembles the technique of ideal time-frequency segregation (ITFS) that is known from computational auditory scene analysis studies (e.g., Wang 2005;Brungart 2006;Kjems et al 2009). Although these studies used diotic signals and applied Fig.…”

Section: Discussionmentioning

confidence: 99%

Intelligibility for Binaural Speech with Discarded Low-SNR Speech Components

Schoenmaker

Par

2016

Advances in Experimental Medicine and Biology

View full text Add to dashboard Cite

Speech intelligibility in multitalker settings improves when the target speaker is spatially separated from the interfering speakers. A factor that may contribute to this improvement is the improved detectability of target-speech components due to binaural interaction in analogy to the Binaural Masking Level Difference (BMLD). This would allow listeners to hear target speech components within specific time-frequency intervals that have a negative SNR, similar to the improvement in the detectability of a tone in noise when these contain disparate interaural difference cues. To investigate whether these negative-SNR target-speech components indeed contribute to speech intelligibility, a stimulus manipulation was performed where all target components were removed when local SNRs were smaller than a certain criterion value. It can be expected that for sufficiently high criterion values target speech components will be removed that do contribute to speech intelligibility. For spatially separated speakers, assuming that a BMLD-like detection advantage contributes to intelligibility, degradation in intelligibility is expected already at criterion values below 0 dB SNR. However, for collocated speakers it is expected that higher criterion values can be applied without impairing speech intelligibility. Results show that degradation of intelligibility for separated speakers is only seen for criterion values of 0 dB and above, indicating a negligible contribution of a BMLD-like detection advantage in multitalker settings. These results show that the spatial benefit is related to a spatial separation of speech components at positive local SNRs rather than to a BMLD-like detection improvement for speech components at negative local SNRs.

show abstract

Section: Discussionmentioning

confidence: 99%

Intelligibility for Binaural Speech with Discarded Low-SNR Speech Components

Schoenmaker

Par

2016

Advances in Experimental Medicine and Biology

View full text Add to dashboard Cite

show abstract

“…The choices made here, with LC about 5 dB smaller than input SNR, were motivated by values shown to be effective for noisy sentences (Brungart et al, 2006;Li and Loizou, 2008;Wang et al, 2009;Kjems et al, 2009). It is possible that LC values that are most effective for consonant materials will differ from those for sentence materials, perhaps due to the increased requirements for acoustic speech information and increased reliance on bottom-up processing.…”

Section: Assessing Benefitmentioning

confidence: 99%

Speech-cue transmission by an algorithm to increase consonant recognition in noise for hearing-impaired listeners

Healy

Yoho

Wang

et al. 2014

The Journal of the Acoustical Society of America

Self Cite

View full text Add to dashboard Cite

Consonant recognition was assessed following extraction of speech from noise using a more efficient version of the speech-segregation algorithm described in Healy, Yoho, Wang, and Wang [(2013) J. Acoust. Soc. Am. 134, 3029-3038]. Substantial increases in recognition were observed following algorithm processing, which were significantly larger for hearing-impaired (HI) than for normalhearing (NH) listeners in both speech-shaped noise and babble backgrounds. As observed previously for sentence recognition, older HI listeners having access to the algorithm performed as well or better than young NH listeners in conditions of identical noise. It was also found that the binary masks estimated by the algorithm transmitted speech features to listeners in a fashion highly similar to that of the ideal binary mask (IBM), suggesting that the algorithm is estimating the IBM with substantial accuracy. Further, the speech features associated with voicing, manner of articulation, and place of articulation were all transmitted with relative uniformity and at relatively high levels, indicating that the algorithm and the IBM transmit speech cues without obvious deficiency. Because the current implementation of the algorithm is much more efficient, it should be more amenable to real-time implementation in devices such as hearing aids and cochlear implants.

show abstract

“…The implementation is based on the observation that the structure and shape of the binary mask patterns is important for both human [15] and machine recognition of speech [26], and that there are similarities between the binary patterns corresponding to a phonetic unit [27]. Our goal is to encode the prior information about the structure of the binary mask corresponding to a BPU in a simple averaged model that can then be used to refine a bottom-up mask estimated using a conventional IBM estimation algorithm.…”

Section: Implementation Using Average Mask Priorsmentioning

confidence: 99%

“…An element of this vector represents the probability of the frequency channel being speech dominant given the phonetic identity of the time-frame. Since we want the AMPs to be independent of a specific noise condition, they are formed based on the target binary mask (TBM) [15] as opposed to the ideal binary mask. The TBM is defined similar to Eq.…”

Section: Implementation Using Average Mask Priorsmentioning

confidence: 99%

“…α is a parameter that controls the amount of attenuation to be applied to noise dominant T-F units during resynthesis/feature extraction. Processing noisy signals using the IBM substantially improves intelligibility [15] and robustness of ASR systems [13]. Note that the above definition assumes ideal knowledge; in practice, the IBM has to be estimated directly from the noisy signal.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Coupling binary masking and robust ASR

Narayanan

Wang

2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

Self Cite

View full text Add to dashboard Cite

We present a novel framework for performing speech separation and robust automatic speech recognition (ASR) in a unified fashion. Separation is performed by estimating the ideal binary mask (IBM), which identifies speech dominant and noise dominant units in a time-frequency (T-F) representation of the noisy signal. ASR is performed on extracted cepstral features after binary masking. Previous systems perform these steps in a sequential fashion -separation followed by recognition. The proposed framework, which we call bidirectional speech decoding (BSD), unifies these two stages. It does this by using multiple IBM estimators each of which is designed specifically for a back-end acoustic phonetic unit (BPU) of the recognizer. The standard ASR decoder is modified to use these IBM estimators to obtain BPU-specific cepstra during likelihood calculation. On the Aurora-4 robust ASR task, the proposed framework obtains a relative improvement of 17% in word error rate over the noisy baseline. It also obtains significant improvements in the quality of the estimated IBM.

show abstract

Role of mask pattern in intelligibility of ideal binary-masked noisy speech

Cited by 152 publications

References 24 publications

Intelligibility for Binaural Speech with Discarded Low-SNR Speech Components

Intelligibility for Binaural Speech with Discarded Low-SNR Speech Components

Speech-cue transmission by an algorithm to increase consonant recognition in noise for hearing-impaired listeners

Coupling binary masking and robust ASR

Contact Info

Product

Resources

About