Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks

Povey, Daniel; Cheng, Gaofeng; Wang, Yiming; Li, Ke; Xu, Hainan; Yarmohammadi, Mahsa; Khudanpur, Sanjeev

doi:10.21437/interspeech.2018-1417

Cited by 382 publications

(251 citation statements)

References 15 publications

Supporting

Mentioning

247

Contrasting

Unclassified

Order By: Relevance

“…The mouth ROI of the target speaker is fed into the LipNet to generated the visual features. The RecogNet is a TDNN network with factored time-delay neural network (TDNN-F) [33] components, which has been shown to be effective in modeling long range temporal dependencies [33]. In our experiments, the hybrid TDNN AVSR system trained with LF-MMI criterion demonstrates the stateof-the-art performance on the LRS2 dataset.…”

Section: Audio-visual Speech Recognitionmentioning

confidence: 83%

Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Automatic recognition of overlapped speech remains a highly challenging task to date. Motivated by the bimodal nature of human speech perception, this paper investigates the use of audio-visual technologies for overlapped speech recognition. Three issues associated with the construction of audio-visual speech recognition (AVSR) systems are addressed. First, the basic architecture designs i.e. end-to-end and hybrid of AVSR systems are investigated. Second, purposefully designed modality fusion gates are used to robustly integrate the audio and visual features. Third, in contrast to a traditional pipelined architecture containing explicit speech separation and recognition components, a streamlined and integrated AVSR system optimized consistently using the lattice-free MMI (LF-MMI) discriminative criterion is also proposed. The proposed LF-MMI time-delay neural network (TDNN) system establishes the state-of-the-art for the LRS2 dataset. Experiments on overlapped speech simulated from the LRS2 dataset suggest the proposed AVSR system outperformed the audio only baseline LF-MMI DNN system by up to 29.98% absolute in word error rate (WER) reduction, and produced recognition performance comparable to a more complex pipelined system. Consistent performance improvements of 4.89% absolute in WER reduction over the baseline AVSR system using feature fusion are also obtained.

show abstract

Section: Audio-visual Speech Recognitionmentioning

confidence: 83%

Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Two types of neural network-based acoustic model architectures were evaluated: (1) the recently proposed TDNN-F models [21], which have been shown to be effective in underresourced scenarios, and (2) TDNN-F with added convolutional layers (CNN-TDNN-F). It has recently been shown that the locality, weight sharing and pooling properties of the convolutional layers have potential to improve the performance of ASR [26].…”

Section: Acoustic Modellingmentioning

confidence: 99%

Semi-Supervised Acoustic Model Training for Five-Lingual Code-Switched ASR

et al. 2019

View full text Add to dashboard Cite

This paper presents recent progress in the acoustic modelling of under-resourced code-switched (CS) speech in multiple South African languages. We consider two approaches. The first constructs separate bilingual acoustic models corresponding to language pairs (English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho). The second constructs a single unified five-lingual acoustic model representing all the languages (English, isiZulu, isiXhosa, Setswana and Sesotho). For these two approaches we consider the effectiveness of semisupervised training to increase the size of the very sparse acoustic training sets. Using approximately 11 hours of untranscribed speech, we show that both approaches benefit from semi-supervised training. The bilingual TDNN-F acoustic models also benefit from the addition of CNN layers (CNN-TDNN-F), while the five-lingual system does not show any significant improvement. Furthermore, because English is common to all language pairs in our data, it dominates when training a unified language model, leading to improved English ASR performance at the expense of the other languages. Nevertheless, the fivelingual model offers flexibility because it can process more than two languages simultaneously, and is therefore an attractive option as an automatic transcription system in a semi-supervised training pipeline.

show abstract

“…The initial baseline system [11] of the CHiME-5 challenge uses a Time Delay Neural Network (TDNN) acoustic model (AM). However, recently it has been shown that introducing factorized layers into the TDNN architecture facilitates training deeper networks and also improves the ASR performance [25]. This architecture has been employed in the new baseline system for the challenge [10].…”

Section: Acoustic Modelmentioning

confidence: 99%

An Investigation into the Effectiveness of Enhancement in ASR Training and Test for Chime-5 Dinner Party Transcription

Zorilă

Boeddeker

Doddipatla

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Despite the strong modeling power of neural network acoustic models, speech enhancement has been shown to deliver additional word error rate improvements if multi-channel data is available. However, there has been a longstanding debate whether enhancement should also be carried out on the ASR training data. In an extensive experimental evaluation on the acoustically very challenging CHiME-5 dinner party data we show that: (i) cleaning up the training data can lead to substantial error rate reductions, and (ii) enhancement in training is advisable as long as enhancement in test is at least as strong as in training. This approach stands in contrast and delivers larger gains than the common strategy reported in the literature to augment the training database with additional artificially degraded speech. Together with an acoustic model topology consisting of initial CNN layers followed by factorized TDNN layers we achieve with 41.6 % and 43.2 % WER on the DEV and EVAL test sets, respectively, a new single-system state-of-the-art result on the CHiME-5 data. This is a 8 % relative improvement compared to the best word error rate published so far for a speech recognizer without system combination.

show abstract

Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks

Cited by 382 publications

References 15 publications

Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset

Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset

Semi-Supervised Acoustic Model Training for Five-Lingual Code-Switched ASR

An Investigation into the Effectiveness of Enhancement in ASR Training and Test for Chime-5 Dinner Party Transcription

Contact Info

Product

Resources

About