2017
DOI: 10.48550/arxiv.1703.02136
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

English Conversational Telephone Speech Recognition by Humans and Machines

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

2
71
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
7
3

Relationship

0
10

Authors

Journals

citations
Cited by 45 publications
(74 citation statements)
references
References 0 publications
2
71
0
Order By: Relevance
“…To take the English speech recognition task as an example, the Wall Street Journal corpus, which consists of 80 hours of narrated news articles [3], is almost 20 years old, and has a word error rate (WER) of 2.32% on its eval92 benchmark [4]. The Switchboard and Fisher corpus, which consists of 262 and 1,698 hours of telephone conversational speech, is also around 20 years old, and has a WER of 5.5% on the Switchboard portion of the Hub5'00 benchmark [5]. Even LibriSpeech [6], one of the most popular corpora for speech recognition tasks, is more than 5 years old, and has a WER of 1.9% on its test clean benchmark [7].…”
Section: Introductionmentioning
confidence: 99%
“…To take the English speech recognition task as an example, the Wall Street Journal corpus, which consists of 80 hours of narrated news articles [3], is almost 20 years old, and has a word error rate (WER) of 2.32% on its eval92 benchmark [4]. The Switchboard and Fisher corpus, which consists of 262 and 1,698 hours of telephone conversational speech, is also around 20 years old, and has a WER of 5.5% on the Switchboard portion of the Hub5'00 benchmark [5]. Even LibriSpeech [6], one of the most popular corpora for speech recognition tasks, is more than 5 years old, and has a WER of 1.9% on its test clean benchmark [7].…”
Section: Introductionmentioning
confidence: 99%
“…Research on Automatic Speech Recognition (ASR) has attracted a lot of attention in recent years (Chiu et al, 2018;Watanabe et al, 2018). Such success has brought remarkable improvements in reaching human-level performance (Xiong et al, 2016;Saon et al, 2017;. This has been achieved by the development of large spoken corpora: supervised (Panayotov et al, 2015;Ardila et al, 2019); semi-supervised (Bell et al, 2015;Ali et al, 2016); and more recently unsupervised (Valk and AlumĂ€e, 2020; transcription.…”
Section: Introductionmentioning
confidence: 99%
“…Speech recognition systems have been around for more than five decades with the latest systems achieving Word Error Rates (WER) of 5.5% [1] [2], owing to the advent of deep learning. Due to existing data security and privacy concerns in cloud-based ASR systems, a clear shift in preference towards on-device deployment of the state-of-the-art Automated Speech Recognition (ASR) models is emerging [3].…”
Section: Introductionmentioning
confidence: 99%