Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-734
|View full text |Cite
|
Sign up to set email alerts
|

Time-Varying Autoregressions for Speaker Verification in Reverberant Conditions

Abstract: In poor room acoustics conditions, speech signals received by a microphone might become corrupted by the signals' delayed versions that are reflected from the room surfaces (e.g. wall, floor). This phenomenon, reverberation, drops the accuracy of automatic speaker verification systems by causing mismatch between the training and testing. Since reverberation causes temporal smearing to the signal, one way to tackle its effects is to study robust feature extraction, particularly based on long-time temporal featu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
14
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
3
2

Relationship

3
2

Authors

Journals

citations
Cited by 6 publications
(14 citation statements)
references
References 17 publications
0
14
0
Order By: Relevance
“…Robust AR methods have been successfully applied as a preprocessing step in computing mel-frequency cepstral coefficients (MFCCs) within front-ends of automatic speech recognition (ASR) [14] and speaker recognition systems [9,8]: Instead of computing the mel-filter bank energies directly from the Fourier magnitude spectrum, the AR method is used to obtain a noise-robust, parametric estimate of the spectral envelope, from which the filter bank energies are computed. The MFCC representation of the spectral envelope is commonly preferred over alternative representation forms such as mel-filter bank energies or line spectral frequencies [15] because of the statistically convenient properties of the cepstral coefficients [16].…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Robust AR methods have been successfully applied as a preprocessing step in computing mel-frequency cepstral coefficients (MFCCs) within front-ends of automatic speech recognition (ASR) [14] and speaker recognition systems [9,8]: Instead of computing the mel-filter bank energies directly from the Fourier magnitude spectrum, the AR method is used to obtain a noise-robust, parametric estimate of the spectral envelope, from which the filter bank energies are computed. The MFCC representation of the spectral envelope is commonly preferred over alternative representation forms such as mel-filter bank energies or line spectral frequencies [15] because of the statistically convenient properties of the cepstral coefficients [16].…”
Section: Methodsmentioning
confidence: 99%
“…In the FDLP approach, bandpass filtered time-trajectories of speech are modeled with an AR model, and these trajectories are used to compute frequency-domain parametric envelope estimates of speech at given time instants. Both TVLP and FDLP have been successfully applied for robust speech feature extraction with improved results over the baseline frame-based processing [8,9], but their use is limited by the requirement of long macro frames that add algorithmic delay. Furthermore, the modeling of time-trajectories in both TVLP and FDLP is motivated by mathematical convenience, and not by arguing that, for example, the polynomial basis function in TVLP or the autoregressive model in FDLP would be the optimal model for the time-trajectory.…”
Section: Introductionmentioning
confidence: 99%
“…Despite of these recent technological advancements, the mismatch issues are still a major concern for its real-world applications [10]. The performance of ASV system considerably degrades in presence of mismatch due to intra-speaker variability caused by the variations in speech duration [10,11], background noise [12], vocal effort [13], spoken languages [14], emotion [15], channels [16], room reverberation [17], etc. In this paper, we focus on one of the most important mismatch factor, speech duration, the amount of speech data used in enrollment and verification.…”
Section: Introductionmentioning
confidence: 99%
“…Thus, we find it important to develop and study features that show good performance across a wide variety of settings to make speaker recognition systems less dependent on large amounts of training data from different conditions. To this end, we propose using two recent feature extraction methods [7,1] for whispered speech that have already shown good results in other studies.…”
Section: Introductionmentioning
confidence: 99%
“…We approach the problem of normal-whisper acoustic mismatch compensation from the viewpoint of robust feature extraction. Since whispered speech is intelligible, yet a * This work contains limited portions of [1]. This is the accepted manuscript of an article published in Speech Communication.…”
mentioning
confidence: 99%