The gammatone filter was imported from auditory physiology to provide a time-domain version of the roex auditory filter and enable the development of a realistic auditory filterbank for models of auditory perception [Patterson et al., J. Acoust. Soc. Am. 98, 1890-1894 (1995)]. The gammachirp auditory filter was developed to extend the domain of the gammatone auditory filter and simulate the changes in filter shape that occur with changes in stimulus level. Initially, the gammachirp filter was limited to center frequencies in the 2.0-kHz region where there were sufficient "notched-noise" masking data to define its parameters accurately. Recently, however, the range of the masking data has been extended in two massive studies. This paper reports how a compressive version of the gammachirp auditory filter was fitted to these new data sets to define the filter parameters over the extended frequency range. The results show that the shape of the filter can be specified for the entire domain of the data using just six constants (center frequencies from 0.25 to 6.0 kHz and levels from 30 to 80 dB SPL). The compressive, gammachirp auditory filter also has the advantage of being consistent with physiological studies of cochlear filtering insofar as the compression of the filter is mainly limited to the passband and the form of the chirp in the impulse response is largely independent of level.
Although the rounded-exponential (roex) filter has been successfully used to represent the magnitude response of the auditory filter, recent studies with the roex(p,w,t) filter reveal two serious problems: the fits to notched-noise masking data are somewhat unstable unless the filter is reduced to a physically unrealizable form, and there is no time-domain version of the roex(p,w,t) filter to support modeling of the perception of complex sounds. This paper describes a compressive gammachirp (cGC) filter with the same architecture as the roex(p,w,t) which can be implemented in the time domain. The gain and asymmetry of this parallel cGC filter are shown to be comparable to those of the roex(p,w,t) filter, but the fits to masking data are still somewhat unstable. The roex(p,w,t) and parallel cGC filters were also compared with the cascade cGC filter [Patterson et al., J. Acoust. Soc. Am. 114, 1529-1542(2003], which was found to provide an equivalent fit with 25% fewer coefficients. Moreover, the fits were stable. The advantage of the cascade cGC filter appears to derive from its parsimonious representation of the high-frequency side of the filter. It is concluded that cGC filters offer better prospects than roex filters for the representation of the auditory filter.
This paper describes a speech-to-singing synthesis system that can synthesize a singing voice, given a speaking voice reading the lyrics of a song and its musical score. The system is based on the speech manipulation system STRAIGHT and comprises three models controlling three acoustic features unique to singing voices: the fundamental frequency (F0), phoneme duration, and spectrum. Given the musical score and its tempo, the F0 control model generates the F0 contour of the singing voice by controlling four types of F0 fluctuations: overshoot, vibrato, preparation, and fine fluctuation. The duration control model lengthens the duration of each phoneme in the speaking voice by considering the duration of its musical note. The spectral control model converts the spectral envelope of the speaking voice into that of the singing voice by controlling both the singing formant and the amplitude modulation of formants in synchronization with vibrato. Experimental results show that the proposed system can convert speaking voices into singing voices whose naturalness is almost the same as actual singing voices.
A basic method for restoring the power envelope from a reverberant signal was proposed by Hirobayashi et al. This method is based on the concept of the modulation transfer function (MTF) and does not require that the impulse response of an environment be measured. However this basic method has the following problems: (i) how to precisely extract the power envelope from the observed signal; (ii) how to determine the parameters of the impulse response of the room acoustics; and (iii) a lack of consideration as to whether the MTF concept can be applied to a more realistic signal. This paper improves this basic method with regard to these problems in order to extend this method as a first step towards the development for speech applications. We have carried out 1,500 simulations for restoring the power envelope from reverberant signals in which the power envelopes are three types of sinusoidal, harmonics, and band-limited noise and the carriers are white noise, to evaluate our improved method with regard to (i) and (ii). We then have carried out the same simulations in which the carriers are two types of carrier of white noise or harmonics with regard to (iii). Our results have shown that the improved method can adequately restore the power envelope from a reverberant signal and will be able to be applied for speech envelope restoration.
This paper proposes a new auditory filterbank that enables signal resynthesis from dynamic representations produced by a level-dependent auditory filterbank. The filterbank is based on a new IIR implementation of the gammachirp, which has been shown to be an excellent candidate for asymmetric, level-dependent auditory filters. Initially, the gammachirp filter is shown to be decomposed into a combination of a gammatone filter and an asymmetric function. The asymmetric function is excellently simulated with a minimum-phase IIR filter, named the "asymmetric compensation filter". Then, two filterbank structures are presented each based on the combination of a gammatone filterbank and a bank of asymmetric compensation filters controlled by a signal level estimation mechanism. The inverse filter of the asymmetric compensation filter is always stable because the minimum-phase condition is satisfied. When a bank of inverse filters is utilized after the gammachirp analysis filterbank and the idea of wavelet transform is applied, it is possible to resynthesize signals with small time-invariant errors and achieve a guaranteed precision. This feature has never been accomplished by conventional active auditory filterbanks. The proposed analysis/synthesis gammachirp filterbank is expected to be useful in various applications where human auditory filtering has to be modeled.
This chapter introduces a state-of-the-art scheme of non-blind digital-audio watermarking, based on the properties of the human cochlear. It is based on the concept of embedding inaudible watermarks into an original sound by controlling its phase characteristics in relation to the characteristics of Cochlear Delay (CD). Inaudible watermarks are embedded into original signals by applying Infinite Impulse Response (IIR) all-pass filters with CDs and they are then extracted from the phase difference between the original and watermarked sounds. The results obtained from objective and subjective evaluations and robustness tests revealed that the CD-based approach is considerably more effective in satisfying the requirements for non-blind inaudible watermarking. Embedding limitations with the CD-based approach were investigated with various evaluations. These results also revealed that embedding limitations with the CD-based approach could be improved by using parallel, cascade, and composite architectures for the CD filters.
Abstract:We previously proposed an improved method for restoring the power envelope from a reverberant signal, based on the modulation transfer function (MTF) concept in order to resolve the problems of Hirobayashi's method. In this paper, to apply our improved method to reverberant speech, we consider three issues related to speech applications: (i) how to apply the improved method to speech dereverberation based on co-modulation characteristics; (ii) whether the MTF concept can also be applied in the sub-band for reverberant signals; and (iii) whether power envelope inverse filtering should be done separately in each channel. We propose an extended filterbank model based on these considerations. We have carried out 15,000 simulations of the power envelope restoration for reverberant speech signals, and our results have shown that the proposed model can adequately restore the power envelopes in all channels from reverberant speech signals. We also found that the estimation of the reverberation time should be done separately in each channel to improve the restoration accuracy of the power envelope.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.