The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments

Lincoln, Mike; McCowan, Iain; Vepa, Jithendra; Maganti, Hari Krishna

doi:10.1109/asru.2005.1566470

Cited by 135 publications

(132 citation statements)

References 8 publications

Supporting

Mentioning

131

Contrasting

Unclassified

Order By: Relevance

“…The first, used the single distant microphone data from the MC-WSJ-CAM0 task [15], here the data were recorded in a "real" environment. The second task, based on the AURORA4 data [16], had additive and reverberant artificially added.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Model-based approaches to handling additive noise in reverberant environments

Gales

Wang

2011

2011 Joint Workshop on Hands-Free Speech Communication and Microphone Arrays

View full text Add to dashboard Cite

Model-based approaches to handle additive and convolutional noise have been extensively investigated and used. However, the application of these approaches to handling reverberant noise has received less attention. This paper examines the extension of two standard adaptation/compensation approaches to handling reverberant noise. The first is an extension of vector Taylor series (VTS) compensation, reverberant VTS, where a mismatch function representing reverberant noise is used. The second scheme modifies constrained MLLR to allow a wide-span of frames to be taken into account and "projected" into the required dimensionality. To allow additive noise to be handled, both these schemes are combined with standard VTS. The approaches are evaluated and compared on two tasks, MC-WSJ-AV, and a reverberant simulated version of AURORA-4.

show abstract

Section: Resultsmentioning

confidence: 99%

“…The MC-WSJ-AV data [15] is divided into one development set (dev1) and two evaluation sets (evl1 and evl2) for each of three conditions. Only one of the conditions single speaker stationary was used in these evaluations.…”

Section: Mc-wsj-av Taskmentioning

confidence: 99%

Model-based approaches to handling additive noise in reverberant environments

Gales

Wang

2011

2011 Joint Workshop on Hands-Free Speech Communication and Microphone Arrays

View full text Add to dashboard Cite

show abstract

“…The speech utterances are taken from the Wall Street Journal (WSJ) corpus (Lincoln et al, 2005). The This database provides a broad phonetic space for speech separation evaluation.…”

Section: Acoustic and Analysis Setupmentioning

confidence: 99%

Computational methods for underdetermined convolutive speech localization and separation via model-based sparse component analysis

Asaei

Bourlard

Taghizadeh

et al. 2016

Speech Communication

View full text Add to dashboard Cite

In this paper, the problem of speech source localization and separation from recordings of convolutive underdetermined mixtures is studied. The problem is cast as recovering the spatio-spectral speech information embedded in a microphone array compressed measurements of the acoustic field. A model-based sparse component analysis framework is formulated for sparse reconstruction of the speech spectra in a reverberant acoustic resulting in joint localization and separation of the individual sources. We compare and contrast the computational approaches to model-based sparse recovery exploiting spatial sparsity as well as spectral structures underlying spectrographic representation of speech signals. In this context, we explore identification of the sparsity structures at the auditory and acoustic representation spaces. The auditory structures are formulated upon the principles of structural grouping based on proximity, autoregressive correlation and harmonicity of the spectral coefficients and they are incorporated for sparse reconstruction. The acoustic structures are formulated upon the image model of multipath propagation and they are exploited to characterize the compressive measurement matrix associated with microphone array recordings.Three approaches to sparse recovery relying on combinatorial optimization, convex relaxation and Bayesian methods are studied and evaluated based on thorough experiments. The sparse Bayesian learning method is shown to yield better perceptual quality while the interference suppression is also achieved using the combinatorial approach with the advantage of offering the most efficient computational cost. Furthermore, it is demonstrated that an average autoregressive model can be learned for speech localization and exploiting the proximity structure in the form of block sparse coefficients enables accurate localization. Throughout the extensive empirical evaluation, we confirm that a large and random placement of the microphones enables significant improvement in source localization and separation performance.

show abstract

“…We performed far-field automatic speech recognition experiments on the PASCAL Speech Separation Challenge 2 (SSC2) [11] corpus. The data contain recordings of two speakers simultaneously and the uttrances is from the 5,000 word vocabulary Wall Street Journal (WSJ) task.…”

Section: Experiments and Resultsmentioning

confidence: 99%

“…Therefore, we propose to first separate the target speech and the interfering speech using MMI beamforming techniques, followed by a Zelinski and binary-masking based postfilter, and then to perform the mapping method for estimating the MFCCs of the clean speech. Our studies on the PASCAL SSC2 corpus [11] show the effectiveness of the proposed methods.…”

Section: Introductionmentioning

confidence: 95%

A Neural Network Based Regression Approach for Recognizing Simultaneous Speech

Liu

Kumatani

Dines

et al.

Machine Learning for Multimodal Interaction

View full text Add to dashboard Cite

Abstract. This paper presents our approach for automatic speech recognition (ASR) of overlapping speech. Our system consists of two principal components: a speech separation component and a feature estmation component. In the speech separation phase, we first estimated the speaker's position, and then the speaker location information is used in a GSC-configured beamformer with a minimum mutual information (MMI) criterion, followed by a Zelinski and binary-masking postfilter, to separate the speech of different speakers. In the feature estimation phase, the neural networks are trained to learn the mapping from the features extracted from the pre-separated speech to those extracted from the close-talking microphone speech signal. The outputs of the neural networks are then used to generate acoustic features, which are subsequently used in acoustic model adaptation and system evaluation. The proposed approach is evaluated through ASR experiments on the PASCAL Speech Separation Challenge II (SSC2) corpus. We demonstrate that our system provides large improvements in recognition accuracy compared with a single distant microphone case and the performance of ASR system can be significantly improved both through the use of MMI beamforming and feature mapping approaches.

show abstract

The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments

Cited by 135 publications

References 8 publications

Model-based approaches to handling additive noise in reverberant environments

Model-based approaches to handling additive noise in reverberant environments

Computational methods for underdetermined convolutive speech localization and separation via model-based sparse component analysis

A Neural Network Based Regression Approach for Recognizing Simultaneous Speech

Contact Info

Product

Resources

About