English Conversational Telephone Speech Recognition by Humans and Machines

Saon, George; Kurata, Gakuto; Sercu, Tom; Audhkhasi, Kartik; Thomas, Samuel; Dimitriadis, Dimitrios; Cui, Xiaodong; Ramabhadran, Bhuvana; Picheny, Michael; Lim, Lynn-Li; Roomi, Bergul; Hall, Phil

doi:10.48550/arxiv.1703.02136

Cited by 45 publications

(74 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To take the English speech recognition task as an example, the Wall Street Journal corpus, which consists of 80 hours of narrated news articles [3], is almost 20 years old, and has a word error rate (WER) of 2.32% on its eval92 benchmark [4]. The Switchboard and Fisher corpus, which consists of 262 and 1,698 hours of telephone conversational speech, is also around 20 years old, and has a WER of 5.5% on the Switchboard portion of the Hub5'00 benchmark [5]. Even LibriSpeech [6], one of the most popular corpora for speech recognition tasks, is more than 5 years old, and has a WER of 1.9% on its test clean benchmark [7].…”

Section: Introductionmentioning

confidence: 99%

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

Chen¹,

Chai²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.

show abstract

Section: Introductionmentioning

confidence: 99%

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

Chen¹,

Chai²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Research on Automatic Speech Recognition (ASR) has attracted a lot of attention in recent years (Chiu et al, 2018;Watanabe et al, 2018). Such success has brought remarkable improvements in reaching human-level performance (Xiong et al, 2016;Saon et al, 2017;. This has been achieved by the development of large spoken corpora: supervised (Panayotov et al, 2015;Ardila et al, 2019); semi-supervised (Bell et al, 2015;Ali et al, 2016); and more recently unsupervised (Valk and Alumäe, 2020; transcription.…”

Section: Introductionmentioning

confidence: 99%

QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

Mubarak¹,

Hussein²,

Chowdhury³

et al. 2021

Preprint

View full text Add to dashboard Cite

We introduce the largest transcribed Arabic speech corpus, QASR 1 , collected from the broadcast domain. This multi-dialect speech dataset contains 2, 000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics-and/or linguistics-based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community.

show abstract

“…Speech recognition systems have been around for more than five decades with the latest systems achieving Word Error Rates (WER) of 5.5% [1] [2], owing to the advent of deep learning. Due to existing data security and privacy concerns in cloud-based ASR systems, a clear shift in preference towards on-device deployment of the state-of-the-art Automated Speech Recognition (ASR) models is emerging [3].…”

Section: Introductionmentioning

confidence: 99%

VoiceMoji: A Novel On-Device Pipeline for Seamless Emoji Insertion in Dictation

Kumar,

Arora

2021

Preprint

View full text Add to dashboard Cite

Most of the speech recognition systems recover only words in the speech and fail to capture emotions. Users have to manually add emoji(s) in text for adding tone and making communication fun. Though there is much work done on punctuation addition on transcribed speech, the area of emotion addition is untouched. In this paper, we propose a novel on-device pipeline to enrich the voice input experience. It involves, given a blob of transcribed text, intelligently processing and identifying structure where emoji insertion makes sense. Moreover, it includes semantic text analysis to predict emoji for each of the sub-parts for which we propose a novel architecture Attention-based Char Aware (ACA) LSTM which handles Out-Of-Vocabulary (OOV) words as well. All these tasks are executed completely on-device and hence can aid on-device dictation systems. To the best of our knowledge, this is the first work that shows how to add emoji(s) in the transcribed text. We demonstrate that our components achieve comparable results to previous neural approaches for punctuation addition and emoji prediction with 80% fewer parameters. Overall, our proposed model has a very small memory footprint of a mere 4MB to suit on-device deployment.

show abstract

English Conversational Telephone Speech Recognition by Humans and Machines

Cited by 45 publications

References 0 publications

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

VoiceMoji: A Novel On-Device Pipeline for Seamless Emoji Insertion in Dictation

Contact Info

Product

Resources

About