Speaker and noise factorisation on the AURORA4 task

Wang, Y.-Q.; Gales, Mark J. F.

doi:10.1109/icassp.2011.5947375

Cited by 21 publications

(13 citation statements)

References 11 publications

(15 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The configuration for estimating VTS transforms was the same as that used for the AURORA-4 task in [19]. For reference the performance on the original AURORA 4 data for these three test sets were 6.9% (01) , 19.5% (04) and 11.8% (08).…”

Section: Aurora-4 Taskmentioning

confidence: 99%

Model-based approaches to handling additive noise in reverberant environments

Gales

Wang

2011

2011 Joint Workshop on Hands-Free Speech Communication and Microphone Arrays

Self Cite

View full text Add to dashboard Cite

Model-based approaches to handle additive and convolutional noise have been extensively investigated and used. However, the application of these approaches to handling reverberant noise has received less attention. This paper examines the extension of two standard adaptation/compensation approaches to handling reverberant noise. The first is an extension of vector Taylor series (VTS) compensation, reverberant VTS, where a mismatch function representing reverberant noise is used. The second scheme modifies constrained MLLR to allow a wide-span of frames to be taken into account and "projected" into the required dimensionality. To allow additive noise to be handled, both these schemes are combined with standard VTS. The approaches are evaluated and compared on two tasks, MC-WSJ-AV, and a reverberant simulated version of AURORA-4.

show abstract

Section: Aurora-4 Taskmentioning

confidence: 99%

Model-based approaches to handling additive noise in reverberant environments

Gales

Wang

2011

2011 Joint Workshop on Hands-Free Speech Communication and Microphone Arrays

Self Cite

View full text Add to dashboard Cite

show abstract

“…This is not optimal if considering the nonlinear nature of the mismatch function relating the clean speech and the noisy speech. In a recent work [9], two combination schemes of MLLR and VTS are considered. One combination called "VTS+MLLR" conducts MLLR on top of the standard VTS.…”

Section: Introductionmentioning

confidence: 99%

“…The "Joint" scheme replaces the clean speech model used in the VTS with a speaker-adapted clean speech model by MLLR transform. It is discovered that the speaker's MLLR transform estimated from the noisy speech using the "Joint" scheme still models some of the limitations of the VTS mismatch function [9], i.e. carries information about current noise characteristics.…”

Section: Introductionmentioning

confidence: 99%

Combining eigenvoice speaker modeling and VTS-based environment compensation for robust speech recognition

Deng

2012

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

1Eigenvoice and vector Taylor series (VTS) are good models for speaker differences and environmental variations separately. However, speaker and environmental variation always coexist in real-world speech. In this paper, we propose to combine eigenvoice and VTS. Specifically, we introduce eigenvoice speaker modeling for the clean speech into VTS's nonlinear mismatch function. In contrast, the standard VTS uses speakerindependent modeling to represent the clean speech, regardless of speaker differences. The eigenvoice coefficients and the noise model parameters are jointly estimated in the new approach. Experimental results on the Aurora2 task show the improved performances of combining eigenvoice and VTS and demonstrate its ability for speaker and noise factorization.

show abstract

“…More recently, a series of studies has been developed, in which speaker and background noise effects are separately characterized using specific transforms. Well-known methods include factorized adaptation [29] and acoustic factorization algorithms [30], [31].…”

Section: Introductionmentioning

confidence: 99%

A MAP-based Online Estimation Approach to Ensemble Speaker and Speaking Environment Modeling

Tsao

Matsuda

Hori

et al. 2014

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

An ensemble speaker and speaking environment modeling (ESSEM) approach was recently developed. This ESSEM process consists of offline and online phases. The offline phase establishes an environment structure using speech data collected under a wide range of acoustic conditions, whereas the online phase estimates a set of acoustic models that matches the testing environment based on the established environment structure. Since the estimated acoustic models accurately characterize particular testing conditions, ESSEM can improve the speech recognition performance under adverse conditions. In this work, we propose two maximum a posteriori (MAP) based algorithms to improve the online estimation part of the original ESSEM framework. We first develop MAP-based environment structure adaptation to refine the original environment structure. Next, we propose to utilize the MAP criterion to estimate the mapping function of ESSEM and enhance the environment modeling capability. For the MAP estimation, three types of priors are derived; they are the clustered prior (CP), the sequential prior (SP), and the hierarchical prior (HP) densities. Since each prior density is able to characterize specific acoustic knowledge, we further derive a combination mechanism to integrate the three priors. Based on the experimental results on the Aurora-2 task, we verify that using the MAP-based online mapping function estimation can enable ESSEM to achieve better performance than using the maximum-likelihood (ML) based counterpart. Moreover, by using an integration of the online environment structuring adaptation and mapping function estimation, the proposed MAP-based ESSEM framework is found to provide the best performance. Compared with our baseline results, MAP-based ESSEM achieves an average word error rate reduction of 15.53% (5.41 to 4.57%) under 50 testing conditions at a signal-to-noise ratio (SNR) of 0 to 20 dB over the three standardized testing sets.Index Terms-Ensemble speaker and speaking environment modeling, ESSEM, MAP, noise robustness.

show abstract

Speaker and noise factorisation on the AURORA4 task

Cited by 21 publications

References 11 publications

Model-based approaches to handling additive noise in reverberant environments

Model-based approaches to handling additive noise in reverberant environments

Combining eigenvoice speaker modeling and VTS-based environment compensation for robust speech recognition

A MAP-based Online Estimation Approach to Ensemble Speaker and Speaking Environment Modeling

Contact Info

Product

Resources

About