Improved tonal language speech recognition by integrating spectro-temporal evidence and pitch information with properly chosen tonal acoustic units

Li, Shang-Wen; Wang, Yow-Bang; Sun, Liang-Che; Lee, Lin-Shan

doi:10.21437/interspeech.2011-609

Cited by 7 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Each feature dimension was Z-normalized per speaker. One additional experiment was performed with Model 1, in which its input feature vector was augmented by a fundamental frequency measurement (F0), because F0 has been shown to reduce ASR error rates for tonal languages [33,34]. F0 was extracted from the same 25ms windowed frame, converted from Hertz to Mel scale, then appended to the 40-dimensional log Mel features.…”

Section: Experimental Methodsmentioning

confidence: 99%

Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

Li,

Hasegawa-Johnson

2020

Preprint

View full text Add to dashboard Cite

Phones, the segmental units of the International Phonetic Alphabet (IPA), are used for lexical distinctions in most human languages; Tones, the suprasegmental units of the IPA, are used in perhaps 70%. Many previous studies have explored cross-lingual adaptation of automatic speech recognition (ASR) phone models, but few have explored the multilingual and cross-lingual transfer of synchronization between phones and tones. In this paper, we test four Connectionist Temporal Classification (CTC)-based acoustic models, differing in the degree of synchrony they impose between phones and tones. Models are trained and tested multilingually in three languages, then adapted and tested cross-lingually in a fourth. Both synchronous and asynchronous models are effective in both multilingual and cross-lingual settings. Synchronous models achieve lower error rate in the joint phone+tone tier, but asynchronous training results in lower tone error rate.

show abstract

Section: Experimental Methodsmentioning

confidence: 99%

Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

Li,

Hasegawa-Johnson

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Some papers have suggested extracting alternative features from the input signal. In [14], Li et al convert the input signal to a spectrogram and convolve it with a set of Gabor filters to yield a set of feature maps. Frame-level tone labels are obtained using forced alignment, and an MLP is trained to predict the tone label for each frame using the feature maps.…”

Section: Existing Approachesmentioning

confidence: 99%

Tone Recognition Using Lifters and CTC

Lugosch

Tomar

2018

Interspeech 2018

View full text Add to dashboard Cite

In this paper, we present a new method for recognizing tones in continuous speech for tonal languages. The method works by converting the speech signal to a cepstrogram, extracting a sequence of cepstral features using a convolutional neural network, and predicting the underlying sequence of tones using a connectionist temporal classification (CTC) network. The performance of the proposed method is evaluated on a freely available Mandarin Chinese speech corpus, AISHELL-1, and is shown to outperform the existing techniques in the literature in terms of tone error rate (TER).

show abstract