Hynek Heřmanský scite author profile

A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, is presented and examined. This technique uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum: (1) the critical-band spectral resolution, (2) the equal-loudness curve, and (3) the intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model. A 5th-order all-pole model is effective in suppressing speaker-dependent details of the auditory spectrum. In comparison with conventional linear predictive (LP) analysis, PLP analysis is more consistent with human hearing. The effective second formant F2' and the 3.5-Bark spectral-peak integration theories of vowel perception are well accounted for. PLP analysis is computationally efficient and yields a low-dimensional representation of speech. These properties are found to be useful in speaker-independent automatic-speech recognition.

show abstract

RASTA processing of speech

Heřmanský

Morgan

1994

IEEE Trans. Speech Audio Process.

1,499

708

View full text Add to dashboard Cite

Tandem connectionist feature extraction for conventional HMM systems

Sharma³

View full text Add to dashboard Cite

Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to individual subword units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estimate the probability distribution among subword units given the acoustic observations. In this work we show a large improvement in word recognition performance by combining neural-net discriminative feature processing with Gaussian-mixture distribution modeling. By training the network to generate the subword probability posteriors, then using transformations of these estimates as the base features for a conventionally-trained Gaussian-mixture based system, we achieve relative error rate reductions of 35% or more on the multicondition Aurora noisy continuous digits task.

show abstract

Multilingual MLP features for low-resource LVCSR systems

2012

View full text Add to dashboard Cite

Deep neural network features and semi-supervised training for low resource speech recognition

Thomas

Seltzer

Church³

et al. 2013

136

100

View full text Add to dashboard Cite

We propose a new technique for training deep neural networks (DNNs) as data-driven feature front-ends for large vocabulary continuous speech recognition (LVCSR) in low resource settings. To circumvent the lack of sufficient training data for acoustic modeling in these scenarios, we use transcribed multilingual data and semi-supervised training to build the proposed feature front-ends. In our experiments, the proposed features provide an absolute improvement of 16% in a low-resource LVCSR setting with only one hour of in-domain training data. While close to three-fourths of these gains come from DNN-based features, the remaining are from semi-supervised training.

show abstract

RASTA-PLP speech analysis technique

et al. 1992

View full text Add to dashboard Cite

On the relative importance of various components of the modulation spectrum for automatic speech recognition

Kanedera

Arai

Heřmanský

et al. 1999

Speech Communication

119

View full text Add to dashboard Cite

Weak top-down constraints for unsupervised acoustic model training

Jansen

Thomas

Heřmanský

2013

View full text Add to dashboard Cite

Typical supervised acoustic model training relies on strong top-down constraints provided by dynamic programming alignment of the input observations to phonetic sequences derived from orthographic word transcripts and pronunciation dictionaries. This paper investigates a much weaker form of top-down supervision for use in place of transcripts and dictionaries in the zero resource setting. Our proposed constraints, which can be produced using recent spoken term discovery systems, come in the form of pairs of isolated word examples that share the same unknown type. For each pair, we perform a dynamic programming alignment of the acoustic observations of the two constituent examples, generating an inventory of cross-speaker frame pairs that each provide evidence that the same subword unit model should account for them. We find these weak top-down constraints are capable of improving model speaker independence by up to 57% relative over bottom-up training alone.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.