We claim that speech analysis algorithms should be based on computational models of human audition, starting at the ears. While much is known about how hearing works, little of this knowledge has been applied in the speech analysis field. We propose models of the inner ear, or cochlea, which are expressed as time-and place-domain signal processing operations; i.e. the models are computational expressions of the important functions of the cochlea. The main parts of the models concern mechanical filtering effects and the mapping of mechanical vibrations into neural representation. Our model cleanly separates these effects into time-invariant linear filtering based on a simple cascade/parallel fllterbank network of second-order sections, plus transduction and compression based on half-wave rectification with a nonlinear coupled tomatio glzirt control network. Compared to other speech analysis techniques, this model does a much better job of preserving important detail in both time and frequency, which is important for robust sound analysis. We discuss the ways in which this model differs from more detailed cochlear models.
Abstract-An P.ngineered system that hears, such as a speech recognizer, can be designed by modeling the cochlea, or inner ear, and higher levels of the auditory nervous system. To be useful in such a system, a model of the cochlea should incorporate a variety of known effects, such as an asymmetric low-pass/bandpass response at each output channel, a short ringing time, and active adaptation to a wide range of input signal levels. An analog electronic cochlea has been built in CMOS VLSI technology using micropower techniques to achieve this goal of usefulness via realism. The key point of the model and circuit is that a cascade of simple, nearly linear, second-order filter stages with controllable Q parameters suffices to capture the physics of the fluid-dynamic traveling-wave system in the cochlea, including the effects of adaptation and active gain involving the outer hair cells. Measurements on the test chip suggest that the circuit matches both the theory and observations from real cochleas.
Robust and far-field speech recognition is critical to enable true hands-free communication. In far-field conditions, signals are attenuated due to distance. To improve robustness to loudness variation, we introduce a novel frontend called perchannel energy normalization (PCEN). The key ingredient of PCEN is the use of an automatic gain control based dynamic compression to replace the widely used static (such as log or root) compression. We evaluate PCEN on the keyword spotting task. On our large rerecorded noisy and far-field eval sets, we show that PCEN significantly improves recognition performance. Furthermore, we model PCEN as neural network layers and optimize high-dimensional PCEN parameters jointly with the keyword spotting acoustic model. The trained PCEN frontend demonstrates significant further improvements without increasing model complexity or inference-time cost.
AbetractMultiple sound signals, such as speech and interfering noises, can be fairly well separated, localized, and interpreted by human listeners with normal binaural hearing. The computational model presented here, based on earlier cochlear modeling work, is a first step at approaching human levels of performance on the localization and separation tasks. This combination of cochlear and binaural models, implemented as real-time algorithms, could provide the front end for a robust sound interpretation system such as a speech recognizer. The cochlear model used is basically a bandpass filterbank with frequency channels corresponding to places on the basilar membrane; filter outputs are half-wave rectified and amplitude-compressed, maintaining fine time resolution. In the binaural model, outputs of corresponding frequency channels from the two ears are combined by cross-correlation. Peaks in the short-time cross-correlation functions are then interpreted as direction. With appropriate preprocessing, the correlation peaks integrate cues based on signal phase, envelope modulation, onset time, and loudness. Based on peaks in the correlation functions, sources can be recognized, localized, and tracked. Through quickly varying gains, sound fragments are separated into streams representing different sources. Preliminary tests of the algorithms are very encouraging.
We have implemented a pitch detector based on Licklider's "Duplex Theory" of pitch perception, and tested it on a variety of stimuli from human perceptual tests. We believe that this approach accurately models how humans perceive pitch. We show that it correctly identifies the pitch of complex harmonic and inharmonic stimuli, and that it is robust in the face of noise and phase changes.This perceptual pitch detector combines a cochlear model with a bank of autocorrelators. By performing an independent autocorrelation for each channel, the pitch detector is relatively insensitive to phase changes across channels. The information in the correlogram is filtered, nonlinearly enhanced, and summed across channels. Peaks are identified and a pitch is then proposed that is consistent with the peaks. IntroductionThis paper describes a pitch detector that mimics the human perceptual system. Traditional approaches base a pitch decision on features of a relatively primitive representation such as the waveform or spectrum. Our pitch detector uses an auditory model. Unlike the simpler techniques, this perceptual technique works for a wide range of pitch effects, and is robust against a wide range of distortions.The technique used was first proposed by Licklider [1] as a model of pitch perception, but it has not been taken seriously as a computational approach to pitch detection due to its high computational cost.The representation used by the pitch detector, which corresponds to the output of Licklider's duplex theory, is the correlogram. This representation is unique in its richness, as it shows the spectral content and time structure of a sound on independent axes of an animated display. A pitch detection algorithm analyzes the information in the correlogram and chooses a single best pitch.There are many signals, such as inharmonic tones or tones in noise, that do not have a periodic time or frequency-domain structure, yet humans can assign pitches to them. The perceptual pitch detector can handle these difficult cases and is thus more robust when dealing with the common cases. We expect that future systems will benefit by using this approach, or a cost-reduced version.There is still considerable freedom to devise algorithms to reduce the rich correlogram representation to a pitch decision. The results we report are from a relatively simple algorithm, which does not address many of the subtle issues involved in a pitch tracker for use in a real system. Our algorithm picks a pitch for each
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.