Abstract-This document briefly describes the systems submitted by the Center for Robust Speech Systems (CRSS) from The University of Texas at Dallas (UTD) in the 2010 NIST Speaker Recognition Evaluation. Our systems primarily use factor analysis as feature extractor [1] and support vector machine (SVM) classification framework. Our main focus in the evaluation is on the telephone trials in the core condition and 10 second train-test condition. Novel elements in our system include a supervised probabilistic principal component analysis (SPPCA) based approach for factor analysis, and an algorithm for optimal selection of the negative samples for training the SVM.I. SYSTEM COMPONENTS In this section, we describe the specific blocks used for building our systems. Later, we will discuss how these parts were joined together to form our sub-systems. A. Feature ExtractionThe acoustic features used in this submission were identical for all the subsystems. A 60-dimension feature (19 MFCC with log energy + ∆ + ∆∆) using a 25 ms analysis window with 10 ms shift, filtered by feature warping using a 3-s sliding window is employed [2]. To remove the silence frames, a Hungarian phoneme recognizer [3] and an energy based voice activity detection (VAD) method were used. A block diagram of our feature extraction system is shown in Fig.2. B. UBM TrainingTwo gender dependent UBMs with 1024 mixtures were trained on the NIST 2004NIST , 2005NIST , 2006 SRE enrollment data. We used the HTK toolkit for training. 20 iterations per mixture split was used. These UBMs were later used for factor analysis training and the joint factor analysis (JFA) based system. C. Factor analysisWe used two different modeling approaches for our factor analysis training, probabilistic principal component analysis (PPCA) and supervised probabilistic principal component analysis (SPPCA). For both methods, the Switchboard II Phase 2 and 3, Switchboard Cellular Part 1 and 2, and the NIST 2004NIST , 2005NIST , 2006 SRE enrollment data were used as the training data. In total, 400 factors were used. 2) SPPCA method: The supervised probabilistic principal component analysis (SPPCA) model [7] is proposed to integrate the speaker label information into the factor analysis approach using PPCA. The latent factor from the proposed model is believed to be more discriminative than the one from the PPCA model. We have performed extensive experiments on this model, in combination with different types of intersession compensation techniques in the back-end for this evaluation. D. Channel CompensationWe have used three different channel compensation techniques. In most of the cases, they were applied in pairs. They are discussed below. 1) Linear discriminant analysis (LDA):LDA is a common technique for dimensionality reduction and widely used in pattern recognition applications. NIST 2004NIST , 2005NIST , 2006 SRE enrollment data are used as the training data for LDA. 2) Nuisance attribute projection (NAP):The NAP algorithm [8] is used to find a projection matrix intend...
The presence of physical task stress induces changes in the speech production system which in turn produces changes in speaking behavior. This results in measurable acoustic correlates including changes to formant center frequencies, breath pause placement, and fundamental frequency. Many of these changes are due to the subject's internal competition between speaking and breathing during the performance of the physical task, which has a corresponding impact on muscle control and airflow within the glottal excitation structure as well as vocal tract articulatory structure. This study considers the effect of physical task stress on voice quality. Three signal processingbased values which include (i) the normalized amplitude quotient (NAQ), (ii) the harmonic richness factor (HRF), and (iii) the fundamental frequency are used to measure voice quality. The effects of physical stress on voice quality depend on the speaker as well as the specific task. While some speakers do not exhibit changes in voice quality, a subset exhibits changes in NAQ and HRF measures of similar magnitude to those observed in studies of soft, loud, and pressed speech. For those speakers demonstrating voice quality changes, the observed changes tend toward breathy or soft voicing as observed in other studies. The effect of physical stress on the fundamental frequency is correlated with the effect of physical stress on the HRF (r = −0.34) and the NAQ (r = −0.53). Also, the inter-speaker variation in baseline NAQ is significantly higher than the variation in NAQ induced by physical task stress. The results illustrate systematic changes in speech production under physical task stress, which in theory will impact subsequent speech technology such as speech recognition, speaker recognition, and voice diarization systems.
Physical task stress is known to affect the fundamental frequency and other measurements of the speech signal. A corpus of physical task stress speech is analyzed using a spectrum F-ratio and frame score distribution divergences. The measurements differ between phone classes, and are greater for vowels and nasals than for plosives and fricatives. In further analysis, frame score distribution divergences are used to measure the spectral dissimilarity between neutral and physical task stress speech. Frame scores are the log likelihood ratios between Gaussian mixture models (GMMs) of physical task stress and of neutral speech. Mel-frequency cepstral coefficients are used as the acoustic feature inputs to the GMMs. A Laplacian distribution is fitted to the frame scores for each of ten phone classes, and the symmetric Kullback-Leibler divergence is employed to measure the change in distribution from neutral to physical task stress. The results suggest that the spectral dissimilarity is greatest for the second level of a four level exertion measurement, and that spectral dissimilarity is greater for nasal phones than for plosives and fricatives. Further, the results suggest that different phone classes are affected differently by physical task stress.
Common acoustic sources, like voices or musical instruments, exhibit strong frequency and directional dependence. When transported through complex environments, their anisotropic radiated field undergoes scattering, diffraction, and occlusion before reaching a directionally-sensitive listener. We present the first wave-based interactive auralization system that encodes and renders a complete reciprocal description of acoustic wave fields in general scenes. Our method renders directional effects at freely moving and rotating sources and listeners and supports any tabulated source directivity function and head-related transfer function. We represent a static scene's global acoustic transfer as an 11-dimensional bidirectional impulse response (BIR) field, which we extract from a set of wave simulations. We parametrically encode the BIR as a pair of radiating and arriving directions for the perceptually-salient initial ( direct ) response, and a compact 6 × 6 reflections transfer matrix capturing indirect energy transfer with scene-dependent anisotropy. We render our encoded data with an efficient and scalable algorithm - integrated in the Unreal Engine ™ - whose CPU performance is agnostic to scene complexity and angular source/listener resolutions. We demonstrate convincing effects that depend on detailed scene geometry, for a variety of environments and source types.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.