The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System

Stolcke, Andreas; Anguera, Xavier; Boakye, Kofi; Çetin, Özgür; Janin, Adam; Magimai-Doss, Mathew; Wooters, Chuck; Zheng, Jing

doi:10.1007/978-3-540-68585-2_42

Cited by 47 publications

(28 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also observe the expected results, which have also been earlier observed in the literature [23] [5], that model level adaptation improves performance.…”

Section: Experiments and Resultssupporting

confidence: 90%

“…In practice, it is common for meeting ASR that a well trained acoustic model is first obtained using clean speech data (conversational telephone speech, broadcast news), which is then adapted by using the meeting speech both from close talking microphone (nearfield) as well as distant microphone speech after enhancing the speech by delay-sum beamforming [5] or superdirective beamforming [7]. This approach has been shown to perform well.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Neural Network Based Regression Approach for Recognizing Simultaneous Speech

Liu

Kumatani

Dines

et al.

Machine Learning for Multimodal Interaction

Self Cite

View full text Add to dashboard Cite

Abstract. This paper presents our approach for automatic speech recognition (ASR) of overlapping speech. Our system consists of two principal components: a speech separation component and a feature estmation component. In the speech separation phase, we first estimated the speaker's position, and then the speaker location information is used in a GSC-configured beamformer with a minimum mutual information (MMI) criterion, followed by a Zelinski and binary-masking postfilter, to separate the speech of different speakers. In the feature estimation phase, the neural networks are trained to learn the mapping from the features extracted from the pre-separated speech to those extracted from the close-talking microphone speech signal. The outputs of the neural networks are then used to generate acoustic features, which are subsequently used in acoustic model adaptation and system evaluation. The proposed approach is evaluated through ASR experiments on the PASCAL Speech Separation Challenge II (SSC2) corpus. We demonstrate that our system provides large improvements in recognition accuracy compared with a single distant microphone case and the performance of ASR system can be significantly improved both through the use of MMI beamforming and feature mapping approaches.

show abstract

“…We also observe the expected results, which have also been earlier observed in the literature [23] [5], that model level adaptation improves performance.…”

Section: Experiments and Resultssupporting

confidence: 90%

Section: Introductionmentioning

confidence: 99%

A Neural Network Based Regression Approach for Recognizing Simultaneous Speech

Liu

Kumatani

Dines

et al.

Machine Learning for Multimodal Interaction

Self Cite

View full text Add to dashboard Cite

show abstract

“…We used SRI's Decipher (Stolcke et al, 2008) 9 to produce word confusion networks for our 17 meeting sub-corpus and then ran our detectors on the WCNs' best path. Table 6 shows a comparison of F-scores.…”

Section: Robustness To Asr Outputmentioning

confidence: 99%

Modelling and detecting decisions in multi-party dialogue

Fernández

Frampton

Ehlen

et al. 2008

Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue - SIGdial '08

View full text Add to dashboard Cite

We describe a process for automatically detecting decision-making sub-dialogues in transcripts of multi-party, human-human meetings. Extending our previous work on action item identification, we propose a structured approach that takes into account the different roles utterances play in the decisionmaking process. We show that this structured approach outperforms the accuracy achieved by existing decision detection systems based on flat annotations, while enabling the extraction of more fine-grained information that can be used for summarization and reporting.

show abstract

“…These advanced techniques take into account the estimated noise or interfering signal characteristics for superior noise suppression capability [43,44]. In the context of ASR, beamforming techniques have been successfully exploited in the ICSI/SRI [45] and AMIDA [46] systems for transcriptions of meetings [47]. Another research efforts have explored unified multichannel-based speech recognition such as LIMABEAM and multi-channel-based neural networks speech recognizer.…”

Section: Multi-channel Integration In Acoustic Modelingmentioning

confidence: 99%

Feature mapping using far-field microphones for distant speech recognition

Himawan

Motlíček

Sridharan

2016

Speech Communication

View full text Add to dashboard Cite

Acoustic modeling based on deep architectures has recently gained remarkable success, with substantial improvement of speech recognition accuracy in several automatic speech recognition (ASR) tasks. For distant speech recognition, the multi-channel deep neural network based approaches rely on the powerful modeling capability of deep neural network (DNN) to learn suitable representation of distant speech directly from its multi-channel source. In this model-based combination of multiple microphones, features from each channel are concatenated and used together as an input to DNN. This allows integrating the multi-channel audio for acoustic modeling without any pre-processing steps. Despite powerful modeling capabilities of DNN, an environmental mismatch due to noise and reverberation may result in severe performance degradation when features are simply fed to a DNN without a feature enhancement step. In this paper, we introduce the nonlinear bottleneck feature mapping approach using DNN, to transform the noisy and reverberant features to its clean version. The bottleneck features trained on clean signal are used as a teacher signal because they contain relevant information to phoneme classification, and the mapping is performed with the objective of suppressing noise and reverberation. The individual and combined impacts of beamforming and speaker adaptation techniques along with the feature mapping are examined for distant large vocabulary speech recognition, using a single and multiple far-field microphones. As an alternative to beamforming, experiments with concatenating multiple channel features are conducted. The experimental results on the AMI meeting corpus show that the feature mapping, used in combination with beamforming and speaker adaptation yields a distant speech recognition performance below 50% word error rate (WER), using DNN for acoustic modeling.

show abstract

The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System

Cited by 47 publications

References 16 publications

A Neural Network Based Regression Approach for Recognizing Simultaneous Speech

A Neural Network Based Regression Approach for Recognizing Simultaneous Speech

Modelling and detecting decisions in multi-party dialogue

Feature mapping using far-field microphones for distant speech recognition

Contact Info

Product

Resources

About