Abstract:We propose a novel method for objective speech intelligibility prediction which can be useful in many application domains such as hearing instruments and forensics. Most objective intelligibility measures available in the literature employ some kind of signal-to-noise ratio (SNR) or a correlation-based comparison between the spectro-temporal representations of clean and processed speech. In this paper, we investigate the speech intelligibility prediction from the viewpoint of information theory and introduce n… Show more
“…A typical scale is the ERB (equivalent rectangular bandwidth) scale, e.g., [18], [19]. It is natural, e.g., [15], to consider the auditory-domain signal to have one independent component signal per ERB. Auditory models provide a manner of deriving such component signals.…”
Section: A Model With Production and Interpretation Noisementioning
confidence: 99%
“…The interpretation process for speech is also noisy: speech signals that are ambiguous in their pronunciation may be interpreted in various ways. Information theoretical concepts have been used in the analysis of human hearing [14] and for the definition of measures of intelligibility [15]. These models do not have the notion of production noise, but the model of [14] considers sensory noise, which corresponds to our interpretation noise.…”
mentioning
confidence: 99%
“…These models do not have the notion of production noise, but the model of [14] considers sensory noise, which corresponds to our interpretation noise. The models of [14] and [15] appear not to have been used for optimizing intelligibility.…”
Abstract-We introduce a model of communication that includes noise inherent in the message production process as well as noise inherent in the message interpretation process. The production and interpretation noise processes have a fixed signal-to-noise ratio. The resulting system is a simple but effective model of human communication. The model naturally leads to a method to enhance the intelligibility of speech rendered in a noisy environment. State-of-the-art experimental results confirm the practical value of the model.
“…A typical scale is the ERB (equivalent rectangular bandwidth) scale, e.g., [18], [19]. It is natural, e.g., [15], to consider the auditory-domain signal to have one independent component signal per ERB. Auditory models provide a manner of deriving such component signals.…”
Section: A Model With Production and Interpretation Noisementioning
confidence: 99%
“…The interpretation process for speech is also noisy: speech signals that are ambiguous in their pronunciation may be interpreted in various ways. Information theoretical concepts have been used in the analysis of human hearing [14] and for the definition of measures of intelligibility [15]. These models do not have the notion of production noise, but the model of [14] considers sensory noise, which corresponds to our interpretation noise.…”
mentioning
confidence: 99%
“…These models do not have the notion of production noise, but the model of [14] considers sensory noise, which corresponds to our interpretation noise. The models of [14] and [15] appear not to have been used for optimizing intelligibility.…”
Abstract-We introduce a model of communication that includes noise inherent in the message production process as well as noise inherent in the message interpretation process. The production and interpretation noise processes have a fixed signal-to-noise ratio. The resulting system is a simple but effective model of human communication. The model naturally leads to a method to enhance the intelligibility of speech rendered in a noisy environment. State-of-the-art experimental results confirm the practical value of the model.
“…Recently, information theory (IT) has been proposed as a new paradigm for speech intelligibility prediction [13,14,15]. This is a natural approach to take given that the fundamental goal of speech communication is to transfer information from a talker to a listener.…”
Instrumental measures of speech intelligibility typically produce an index between 0 and 1 that is monotonically related to listening test scores. As such, these measures are dimensionless and do not represent physical quantities. In this paper, we propose a new instrumental intelligibility metric that describes speech intelligibility using bits per second. The proposed metric builds upon an existing intelligibility metric that was motivated by information theory. Our main contribution is that we use a statistical model of speech communication that accounts for noise inherent in the speech production process. Experiments show that the proposed metric performs at least as well as existing state-of-the-art intelligibility metrics.
“…The STOI measure is based on the sum of the correlation between the envelopes of the clean speech signal and the corrupted speech measured with 15 1/3-octave frequency bands starting at 150 Hz. More recently, using the same frequency bands, it has been shown that a mutual information-based measure can perform better than STOI (Taghia and Martin, 2014).…”
This paper presents the design and outcomes of the CHiME-3 challenge, the first open speech recognition evaluation designed to target the increasingly relevant multichannel, mobile-device speech recognition scenario. The paper serves two purposes. First, it provides a definitive reference for the challenge, including full descriptions of the task design, data capture and baseline systems along with a description and evaluation of the 26 systems that were submitted. The best systems re-engineered every stage of the baseline resulting in reductions in word error rate from 33.4% to as low as 5.8%. By comparing across systems, techniques that are essential for strong performance are identified. Second, the paper considers the problem of drawing conclusions from evaluations that use speech directly recorded in noisy environments. The degree of challenge presented by the resulting material is hard to control and hard to fully characterise. We attempt to dissect the various 'axes of difficulty' by correlating various estimated signal properties with typical system performance on a per session and per utterance basis. We find strong evidence of a dependence on signal-to-noise ratio and channel quality. Systems are less sensitive to variations in the degree of speaker motion. The paper concludes by discussing the outcomes of CHiME-3 in relation to the design of future mobile speech recognition evaluations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.