The IBM 2016 English Conversational Telephone Speech Recognition System

Saon, George; Sercu, Tom; Rennie, Steven J.; Kuo, Hong-Kwang Jeff

doi:10.21437/interspeech.2016-1460

Cited by 105 publications

(110 citation statements)

References 24 publications

(47 reference statements)

Supporting

Mentioning

104

Contrasting

Unclassified

Order By: Relevance

“…Finally, novel acoustic models, especially the deep models, require long training and experimental turnaround time. While most research groups in industry [53,23,118,42,119,120,121] have the computational resource and large amount of training data,…”

Section: A1 Motivationmentioning

confidence: 99%

Graph-Based Semisupervised Learning for Acoustic Modeling in Automatic Speech Recognition

Liu

Kirchhoff

2016

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Graph-based semi-supervised learning (SSL) is a widely used semi-supervised learning method in which the labeled data and unlabeled data are jointly represented as a weighted graph, and the information is propagated from the labeled data to the unlabeled data. The key assumption that graph-based SSL makes is that data samples lie on a low dimensional manifold, where samples that are close to each other are expected to have the same class label. More importantly, by exploiting the relationship between training and test samples, graph-based SSL implicitly adapts to the test data.In this thesis, we address several key challenges in applying graph-based SSL to acoustic modeling. We first investigate and compare several state-of-the-art graph-based SSL algorithms on a benchmark dataset. In addition, we propose novel graph construction methods that allow graph-based SSL to handle variable-length input features. We next investigate the efficacy of graph-based SSL in context of a fully-fledged DNN-based ASR system. We compare two different integration frameworks for graph-based learning. First, we propose a lattice-based late integration framework that combines graph-based SSL with the DNN-based acoustic modeling and evaluate the framework on continuous word recognition tasks. Second, we propose an early integration framework using neural graph embeddings and compare two different neural graph embedding features that capture the information of the manifold at different levels. The embedding features are used as input to a DNN system and are shown to outperform the conventional acoustic feature inputs on several medium-to-large vocabulary conversational speech recognition tasks.

show abstract

Section: A1 Motivationmentioning

confidence: 99%

Graph-Based Semisupervised Learning for Acoustic Modeling in Automatic Speech Recognition

Liu

Kirchhoff

2016

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…To achieve this, several previous studies train a speaker independent DNN using many speech samples spoken by many speakers [3][4][5][6][7][8][9][10][11][12][13][14]. Meanwhile, in other speech applications, model specialization to the target speaker has succeeded [15,16]. In text-to-speech synthesis (TTS), the target speaker model is trained using samples spoken by a target speaker, and that has achieved high performance [15].…”

Section: Introductionmentioning

confidence: 99%

Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention

Koizumi

Yatabe

Delcroix

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

106

View full text Add to dashboard Cite

This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance. Conventional studies of deep neural network (DNN)-based speech enhancement mainly focus on building a speaker independent model. Meanwhile, in speech applications including speech recognition and synthesis, it is known that model adaptation to the target speaker improves the accuracy. Our research question is whether a DNN for speech enhancement can be adopted to unknown speakers without any auxiliary guidance signal in test-phase. To achieve this, we adopt multi-task learning of speech enhancement and speaker identification, and use the output of the final hidden layer of speaker identification branch as an auxiliary feature. In addition, we use multi-head self-attention for capturing long-term dependencies in the speech and noise. Experimental results on a public dataset show that our strategy achieves the state-of-the-art performance and also outperform conventional methods in terms of subjective quality.

show abstract

“…D EEP Learning has significantly advanced the state-ofthe-art in speech recognition over the past few years [1]- [3]. Most speech recognisers now employ the neural network and hidden Markov model (NN/HMM) hybrid architecture, first investigated in the early 1990s [4], [5].…”

Section: Introductionmentioning

confidence: 99%

Small-Footprint Highway Deep Neural Networks for Speech Recognition

Renals

2017

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Abstract-State-of-the-art speech recognition systems typically employ neural network acoustic models. However, compared to Gaussian mixture models, deep neural network (DNN) based acoustic models often have many more model parameters, making it challenging for them to be deployed on resource-constrained platforms, such as mobile devices. In this paper, we study the application of the recently proposed highway deep neural network (HDNN) for training small-footprint acoustic models. HDNNs are a depth-gated feedforward neural network, which include two types of gate functions to facilitate the information flow through different layers. Our study demonstrates that HDNNs are more compact than regular DNNs for acoustic modeling, i.e., they can achieve comparable recognition accuracy with many fewer model parameters. Furthermore, HDNNs are more controllable than DNNs: the gate functions of an HDNN can control the behavior of the whole network using a very small number of model parameters. Finally, we show that HDNNs are more adaptable than DNNs. For example, simply updating the gate functions using adaptation data can result in considerable gains in accuracy. We demonstrate these aspects by experiments using the publicly available AMI corpus, which has around 80 hours of training data.

show abstract

The IBM 2016 English Conversational Telephone Speech Recognition System

Cited by 105 publications

References 24 publications

Graph-Based Semisupervised Learning for Acoustic Modeling in Automatic Speech Recognition

Graph-Based Semisupervised Learning for Acoustic Modeling in Automatic Speech Recognition

Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention

Small-Footprint Highway Deep Neural Networks for Speech Recognition

Contact Info

Product

Resources

About