English Conversational Telephone Speech Recognition by Humans and Machines

Saon, George; Kurata, Gakuto; Sercu, Tom; Audhkhasi, Kartik; Thomas, Samuel; Dimitriadis, Dimitrios; Cui, Xiaodong; Ramabhadran, Bhuvana; Picheny, Michael; Lim, Lynn-Li; Roomi, Bergul; Hall, Phil

doi:10.21437/interspeech.2017-405

Cited by 284 publications

(239 citation statements)

References 31 publications

Supporting

Mentioning

231

Contrasting

Order By: Relevance

“…However, while the raw technical performance of contemporary spoken language systems has improved significantly in recent years [as evidenced by corporate giants such as Microsoft and IBM continuing to issue claim and counter-claim as to whose system has the lowest word error rates (Xiong et al, 2016;Saon et al, 2017)], in reality, users' experiences with such systems are often less than satisfactory. Not only can real-world conditions (such as noisy environments, strong accents, older/younger users or nonnative speakers) lead to very poor speech recognition accuracy, but the 'understanding' exhibited by contemporary systems is rather shallow.…”

Section: Limitations Of Current Systemsmentioning

confidence: 99%

Toward a Needs-Based Architecture for ‘Intelligent’ Communicative Agents: Speaking with Intention

Moore

Nicolao

2017

Front. Robot. AI

View full text Add to dashboard Cite

The past few years have seen considerable progress in the deployment of voice-enabled personal assistants, first on smartphones (such as Apple's Siri) and most recently as standalone devices in people's homes (such as Amazon's Alexa). Such 'intelligent' communicative agents are distinguished from the previous generation of speech-based systems in that they claim to offer access to services and information via conversational interaction (rather than simple voice commands). In reality, conversations with such agents have limited depth and, after initial enthusiasm, users typically revert to more traditional ways of getting things done. It is argued here that one source of the problem is that the standard architecture for a contemporary spoken language interface fails to capture the fundamental teleological properties of human spoken language. As a consequence, users have difficulty engaging with such systems, primarily due to a gross mismatch in intentional priors. This paper presents an alternative needs-driven cognitive architecture which models speech-based interaction as an emergent property of coupled hierarchical feedback-control processes in which a speaker has in mind the needs of a listener and a listener has in mind the intentions of a speaker. The implications of this architecture for future spoken language systems are illustrated using results from a new type of 'intentional speech synthesiser' that is capable of optimising its pronunciation in unpredictable acoustic environments as a function of its perceived communicative success. It is concluded that such purposeful behavior is essential to the facilitation of meaningful and productive spoken language interaction between human beings and autonomous social agents (such as robots). However, it is also noted that persistent mismatched priors may ultimately impose a fundamental limit on the effectiveness of speech-based human-robot interaction.

show abstract

Section: Limitations Of Current Systemsmentioning

confidence: 99%

Toward a Needs-Based Architecture for ‘Intelligent’ Communicative Agents: Speaking with Intention

Moore

Nicolao

2017

Front. Robot. AI

View full text Add to dashboard Cite

show abstract

“…Recently, the development of deep learning technologies has led to great progress in the field of automatic speech recognition (ASR). Current state-of-the-art ASR systems are approaching human recognition performance levels [1,2], when speech is recorded with a close-talking microphone. However, recognition of speech recorded by distant microphones remains challenging because of acoustic interference such as noise, reverberation and interference speakers.…”

Section: Introductionmentioning

confidence: 99%

Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network

Kinoshita

Ochiai

Delcroix

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

With the advent of deep learning, research on noise-robust automatic speech recognition (ASR) has progressed rapidly. However, ASR performance in noisy conditions of single-channel systems remains unsatisfactory. Indeed, most single-channel speech enhancement (SE) methods (denoising) have brought only limited performance gains over state-of-the-art ASR back-end trained on multicondition training data. Recently, there has been much research on neural network-based SE methods working in the time-domain showing levels of performance never attained before. However, it has not been established whether the high enhancement performance achieved by such time-domain approaches could be translated into ASR. In this paper, we show that a single-channel time-domain denoising approach can significantly improve ASR performance, providing more than 30 % relative word error reduction over a strong ASR back-end on the real evaluation data of the single-channel track of the CHiME-4 dataset. These positive results demonstrate that single-channel noise reduction can still improve ASR performance, which should open the door to more research in that direction.

show abstract

“…Adversarial domain adaptation is suitable for the situation where no transcription or parallel adaptation data in both domains are available. It can also effectively suppress the environment [12,13,14] and speaker [15,16] variability during domain adaptation. However, in speech area, a parallel sequence of target-domain data can be easily simulated from the source-domain data such that the speech from both domains are frame-by-frame synchronized.…”

Section: Introductionmentioning

confidence: 99%

Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition

Meng

Gaur

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Teacher-student (T/S) has shown to be effective for domain adaptation of deep neural network acoustic models in hybrid speech recognition systems. In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance. To further improve T/S learning with the help of ground-truth labels, we propose adaptive T/S (AT/S) learning. Instead of conditionally choosing from either the teacher's soft token posteriors or the one-hot ground-truth label, in AT/S, the student always learns from both the teacher and the ground truth with a pair of adaptive weights assigned to the soft and one-hot labels quantifying the confidence on each of the knowledge sources. The confidence scores are dynamically estimated at each decoder step as a function of the soft and one-hot labels. With 3400 hours parallel close-talk and far-field Microsoft Cortana data for domain adaptation, T/S and AT/S achieve 6.3% and 10.3% relative word error rate improvement over a strong E2E model trained with the same amount of far-field data.

show abstract

English Conversational Telephone Speech Recognition by Humans and Machines

Cited by 284 publications

References 31 publications

Toward a Needs-Based Architecture for ‘Intelligent’ Communicative Agents: Speaking with Intention

Toward a Needs-Based Architecture for ‘Intelligent’ Communicative Agents: Speaking with Intention

Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network

Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition

Contact Info

Product

Resources

About