The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition

Hori, Takaaki; Chen, Zhuo; Erdoğan, Hakan; Hershey, John R.; Roux, Jonathan Le; Mitra, Vikramjit; Watanabe, Shinji

doi:10.1109/asru.2015.7404833

Cited by 35 publications

(45 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Concerning other parts of the decoder, Hori et al (2015) reported consistent improvements on real and simulated data by replacing the default 3-gram language model used in the baseline by a 5-gram language model with Kneser-Ney (KN) smoothing (Kneser and Ney, 1995), rescoring the lattice using a recurrent neural network language model (RNN-LM) (Mikolov et al, 2010), and fusing the outputs of multiple systems using MBR. This claim also holds true for system combination based on recognizer output voting error reduction (ROVER) (Fiscus, 1997), as reported by Fujita et al (2015).…”

Section: Language Modeling and Rover Fusionmentioning

confidence: 99%

“…In the following, we do not discuss DNN post-filters, which provided a limited improvement or degradation on both real and simulated data (Hori et al, 2015;Sivasankaran et al, 2015), and we focus on multichannel DNN-based enhancement instead. Table 5 illustrates the performance of the DNN-based time-invariant generalized eigenvalue (GEV) beamformer proposed by Heymann et al (2015).…”

Section: Dnn-based Beamforming and Separationmentioning

confidence: 99%

“…Each training setup resulted in a different enhancement system. The resulting ASR performance was evaluated using the updated DNNbased baseline distributed by the organizers after the challenge 4 (Hori et al, 2015). This baseline is identical to the one described in Section 3.1, except that decoding is performed using a 5-gram language model with KN smoothing and RNN-LM based rescoring.…”

Section: Multichannel Dnn-based Separationmentioning

confidence: 99%

“…Other enhancement techniques which result in similar characteristics for enhanced real and simulated signals do not appear to suffer from this problem. Table 7: WER (%) achieved after enhancement by BeamformIt using various feature extraction and normalization methods and the DNN backend retrained on enhanced real and simulated data without sMBR (Hori et al, 2015 Tachioka et al (2015) concatenated logmel or MFCC features with 40-dimensional bottleneck (BN) features extracted as the neuron outputs in the smaller hidden layer of a neural network with two hidden layers trained to predict phoneme posteriors. The neural network was trained on real and simulated data with logmel and pitch features as inputs.…”

Section: Robust Features and Feature Normalizationmentioning

confidence: 99%

See 3 more Smart Citations

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Vincent

Watanabe

Nugraha

et al. 2017

Computer Speech & Language

Self Cite

281

177

View full text Add to dashboard Cite

Section: Language Modeling and Rover Fusionmentioning

confidence: 99%

Section: Dnn-based Beamforming and Separationmentioning

confidence: 99%

Section: Multichannel Dnn-based Separationmentioning

confidence: 99%

Section: Robust Features and Feature Normalizationmentioning

confidence: 99%

See 2 more Smart Citations

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Vincent

Watanabe

Nugraha

et al. 2017

Computer Speech & Language

Self Cite

281

177

View full text Add to dashboard Cite

“…Heymann et al (2015) employ a DNN to perform the necessary speech and noise covariance estimates. Other teams have employed a conventional delay and sum beamformer (e.g., Sivasankaran et al, 2015;Hori et al, 2015;Prudnikov et al, 2015). Of these, several reported that the freely available BeamformIt tool developed by Anguera et al (2007) worked very effectively.…”

Section: Target Enhancementmentioning

confidence: 99%

The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes

Barker

Marxer

Vincent

et al. 2017

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

This paper presents the design and outcomes of the CHiME-3 challenge, the first open speech recognition evaluation designed to target the increasingly relevant multichannel, mobile-device speech recognition scenario. The paper serves two purposes. First, it provides a definitive reference for the challenge, including full descriptions of the task design, data capture and baseline systems along with a description and evaluation of the 26 systems that were submitted. The best systems re-engineered every stage of the baseline resulting in reductions in word error rate from 33.4% to as low as 5.8%. By comparing across systems, techniques that are essential for strong performance are identified. Second, the paper considers the problem of drawing conclusions from evaluations that use speech directly recorded in noisy environments. The degree of challenge presented by the resulting material is hard to control and hard to fully characterise. We attempt to dissect the various 'axes of difficulty' by correlating various estimated signal properties with typical system performance on a per session and per utterance basis. We find strong evidence of a dependence on signal-to-noise ratio and channel quality. Systems are less sensitive to variations in the degree of speaker motion. The paper concludes by discussing the outcomes of CHiME-3 in relation to the design of future mobile speech recognition evaluations.

show abstract

Application of Source Separation to Robust Speech Analysis and Recognition

Watanabe

Virtanen

Kolossa

2018

Audio Source Separation and Speech Enhancement

View full text Add to dashboard Cite

The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition

Cited by 35 publications

References 16 publications

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes

Application of Source Separation to Robust Speech Analysis and Recognition

Contact Info

Product

Resources

About