Mel-frequency Cesptral Coefficients (MFCC) and Predictive Linear Prediction (PLP) coefficients are two popular representations of continuous speech in existing Hidden Markov Model (HMM) based Automatic Speech Recognition (ASR) systems. Cepstral Mean Normalization (CMN) is often used as a post-processing step in the extraction of MFCC and PLP featuresto further enhance noise robustness at almost negligible computational cost. In this paper we build a closed dictionary, large vocabulary HMM-based Indonesian language ASR system using the CMU Sphinx III speech recognition toolkit implementing MFCC and PLP feature extraction, and CMN. We test the effect of various types and levels of noise on the word error rate (WER) of speech recognition. Utilizing CMN, an average improvement of 2% recognition over standard MFCC and PLP extraction methods is obtained at signal-to-noise ratios (SNR) below 24 decibels. A significant drop in recognition is observed between 12 and 6 dB SNR.
Online speaker diarization and identification is the process of determining 'who spoke when' given an ongoing conversation or audio stream, in contrast to the offline scenario where the conversation has concluded and the entire file is available. Online identification is required when speaker identities need to be determined during or directly after speech, for instance in the automatic transcription of live broadcasts and of some meetings. The process of constructing an Indonesian language online speaker identification system is explored, from design, corpus development, to experimentation. The system conducts speaker identification directly on low-energy separated segments and employs a rolling window of time-weighted average likelihoods to improve accuracy, resulting in a system with a latency of one speaker segment for predictions. Experimentation against a standard baseline offline system resulted in speaker error rates (SER) of 25.5% and 18.5% for the proposed online and baseline offline systems, respectively. The latency of the proposed system is 0.21 times the length of input segments, compared to 1.10 for the baseline system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.