The CAPIO 2017 Conversational Speech Recognition System

Han, Kyu J.; Chandrashekaran, Akshay; Kim, Jungsuk; Lane, Ian

doi:10.48550/arxiv.1801.00059

Cited by 9 publications

(14 citation statements)

References 27 publications

(54 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 1 compares the TDS model with three other systems. The CAPIO system is a hybrid HMM-DNN with speaker adaptation [33]. The other two are end-to-end models, one using the CRF-style ASG loss [31] and the other a sequence-to-sequence model with an RNN encoder [23].…”

Section: Resultsmentioning

confidence: 99%

Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

Hannun

Lee

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient than a strong RNN baseline. Key to our approach is a time-depth separable convolution block which dramatically reduces the number of parameters in the model while keeping the receptive field large. We also give a stable and efficient beam search inference procedure which allows us to effectively integrate a language model. Coupled with a convolutional language model, our time-depth separable convolution architecture improves by more than 22% relative WER over the best previously reported sequence-to-sequence results on the noisy LibriSpeech test set.

show abstract

Section: Resultsmentioning

confidence: 99%

Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

Hannun

Lee

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…The error rates on those SWB and CH decrease from 6.5 and 11.9 to 6.2 and 11.4 (Table 2). Our best model is significantly better than previously published CTC [29] and LSTM-based [3] models, and approaches the heavily tuned hybrid system [28] with dense TDNN-LSTM. It is likely possible to reach better error rates, with the help of ensembled models, further data augmentation, and language models.…”

Section: Speech Recognition Resultsmentioning

confidence: 69%

Relative Positional Encoding for Speech Recognition and Direct Translation

Pham¹,

Ha²,

Nguyen³

et al. 2020

Preprint

View full text Add to dashboard Cite

Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. As a result, the network can better adapt to the variable distributions present in speech data. Our experiments show that our resulting model achieves the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also show that this model is able to better utilize synthetic data than the Transformer, and adapts better to variable sentence segmentation quality for speech translation.

show abstract

“…be found in telephony speech or readings of audio books. On standard tasks for this scenario, as switchboard and librispeech [1,2], typical WERs are below 10 % . Nevertheless, ASR on noisy data remains challenging.…”

Section: Introductionmentioning

confidence: 96%

Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech

et al. 2019

View full text Add to dashboard Cite

Significant performance degradation of automatic speech recognition (ASR) systems is observed when the audio signal contains cross-talk. One of the recently proposed approaches to solve the problem of multi-speaker ASR is the deep clustering (DPCL) approach. Combining DPCL with a state-of-the-art hybrid acoustic model, we obtain a word error rate (WER) of 16.5 % on the commonly used wsj0-2mix dataset, which is the best performance reported thus far to the best of our knowledge. The wsj0-2mix dataset contains simulated cross-talk where the speech of multiple speakers overlaps for almost the entire utterance. In a more realistic ASR scenario the audio signal contains significant portions of single-speaker speech and only part of the signal contains speech of multiple competing speakers. This paper investigates obstacles of applying DPCL as a preprocessing method for ASR in such a scenario of sparsely overlapping speech. To this end we present a data simulation approach, closely related to the wsj0-2mix dataset, generating sparsely overlapping speech datasets of arbitrary overlap ratio. The analysis of applying DPCL to sparsely overlapping speech is an important interim step between the fully overlapping datasets like wsj0-2mix and more realistic ASR datasets, such as CHiME-5 or AMI.

show abstract

The CAPIO 2017 Conversational Speech Recognition System

Cited by 9 publications

References 27 publications

Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

Relative Positional Encoding for Speech Recognition and Direct Translation

Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech

Contact Info

Product

Resources

About