A Comparison of Lattice-free Discriminative Training Criteria for Purely Sequence-trained Neural Network Acoustic Models

Weng, Chao; Yu, Dong

doi:10.1109/icassp.2019.8683664

Cited by 6 publications

(4 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Third, the initialization of speaker embeddings should be explored with more advanced speaker diarization techniques [10,37,38]. Finally, advanced ASR techniques, such as data augmentation [39][40][41][42], model ensemble [43][44][45], improved training criterion [46,47], will also improve overall performance. We will explore these directions for future work.…”

Section: Discussionmentioning

confidence: 99%

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Kanda

Horiguchi

Fujita

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both the word error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 % from that of TS-ASR given oracle speaker embeddings. Furthermore, our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.

show abstract

Section: Discussionmentioning

confidence: 99%

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Kanda

Horiguchi

Fujita

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

show abstract

“…LF-MMI [5] criterion was extended to include boosting [22] in [23,24]. Here, we present it again in the generalized hybrid model framework for different modeling units and label topologies.…”

Section: Lf-bmmi Trainingmentioning

confidence: 99%

“…But implementation-wise, in the lattice-free training framework, it is easiest to define this as a sum of per-frame accuracy values. Therefore, as in [24], we use numerator posterior derived from the numerator graph as a proxy for the perframe state-level accuracy values. Besides, the intuition of boosted MMI can also be interpreted by Max-Margin learning [25] [26].…”

Section: Lf-bmmi Trainingmentioning

confidence: 99%

On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models

Zhang¹,

Manohar²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria. However, they have vastly different legacies and are usually implemented in different frameworks. In this paper, by decoupling the concepts of modeling units and label topologies and building proper numerator/denominator graphs accordingly, we establish a generalized framework for hybrid acoustic modeling (AM). In this framework, we show that LF-MMI is a powerful training criterion applicable to both limited-context and full-context models, for wordpiece/mono-char/bi-char/chenone units, with both HMM/CTC topologies. From this framework, we propose three novel training schemes: chenone(ch)/wordpiece(wp)-CTC-bMMI, and wordpiece(wp)-HMM-bMMI with different advantages in training performance, decoding efficiency and decoding time-stamp accuracy. The advantages of different training schemes are evaluated comprehensively on Librispeech, and wp-CTC-bMMI and ch-CTC-bMMI are evaluated on two real world ASR tasks to show their effectiveness. Besides, we also show bi-char(bc) HMM-MMI models can serve as better alignment models than traditional nonneural GMM-HMMs.

show abstract

“…Additionally the language model (LM) has been simplified by limiting the LM context to phone-level four-gram which allows for more frequent recombination of state paths. This approach has been adopted and adapted to different sequence criteria [12,13], score-fusion and system combination [14,15] and settings where we completely dispense with the need of initial GMM models [16].…”

Section: Introductionmentioning

confidence: 99%

Comparison of Lattice-Free and Lattice-Based Sequence Discriminative Training Criteria for LVCSR

2019

View full text Add to dashboard Cite

Sequence discriminative training criteria have long been a standard tool in automatic speech recognition for improving the performance of acoustic models over their maximum likelihood / cross entropy trained counterparts. While previously a lattice approximation of the search space has been necessary to reduce computational complexity, recently proposed methods use other approximations to dispense of the need for the computationally expensive step of separate lattice creation.In this work we present a memory efficient implementation of the forward-backward computation that allows us to use unigram word-level language models in the denominator calculation while still doing a full summation on GPU. This allows for a direct comparison of lattice-based and lattice-free sequence discriminative training criteria such as MMI and sMBR, both using the same language model during training.We compared performance, speed of convergence, and stability on large vocabulary continuous speech recognition tasks like Switchboard and Quaero. We found that silence modeling seriously impacts the performance in the lattice-free case and needs special treatment. In our experiments lattice-free MMI comes on par with its lattice-based counterpart. Lattice-based sMBR still outperforms all lattice-free training criteria.

show abstract

A Comparison of Lattice-free Discriminative Training Criteria for Purely Sequence-trained Neural Network Acoustic Models

Cited by 6 publications

References 20 publications

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models

Comparison of Lattice-Free and Lattice-Based Sequence Discriminative Training Criteria for LVCSR

Contact Info

Product

Resources

About