The History of Speech Recognition to the Year 2030

Hannun, Awni

doi:10.48550/arxiv.2108.00084

Cited by 5 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thirdly, as SSL implicitly learns a language model and other semantic information through the tasks it is subjected to solve [12], the generalizability of these models is only to the extent where data from a similar language or phonetic structure is introduced to it at finetuning. Thus, as correctly pointed out by [13], SSL for speech suffers from the problems of scale, and SSL generalizability can be improved with more efficient training procedures. Prior work for domain adaptation with self-supervised models mostly employ continued pre-training or combined data pre-training approaches [11].…”

Section: Introductionmentioning

confidence: 90%

PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations

Prasad¹,

Ghosh²,

Umesh³

2022

Preprint

View full text Add to dashboard Cite

While self-supervised speech representation learning (SSL) models serve a variety of downstream tasks, these models have been observed to overfit to the domain from which the unlabelled data originates. To alleviate this issue, we propose PADA (Pruning Assisted Domain Adaptation), and zero out redundant weights from models pre-trained on large amounts of outof-domain (OOD) data. Intuitively, this helps to make space for the target-domain ASR finetuning. The redundant weights can be identified through various pruning strategies which have been discussed in detail as a part of this work. Specifically, we investigate the effect of the recently discovered Task-Agnostic and Task-Aware pruning on PADA and propose a new pruning paradigm based on the latter, which we call Cross-Domain Task-Aware Pruning (CD-TAW). CD-TAW obtains the initial pruning mask from a well fine-tuned OOD model, which makes it starkly different from the rest of the pruning strategies discussed in the paper. Our proposed CD-TAW methodology achieves up to 20.6% relative WER improvement over our baseline when fine-tuned on a 2-hour subset of Switchboard data without language model (LM) decoding. Furthermore, we conduct detailed analysis to highlight the key design choices of our proposed method.

show abstract

Section: Introductionmentioning

confidence: 90%

PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations

Prasad¹,

Ghosh²,

Umesh³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Spontaneous speech analysis is a classic challenging task [9]. The conversational speech we are faced with presents specific difficulties, as it is often affected by dispersion, noise and incoherence.…”

Section: Linguistic Analysismentioning

confidence: 99%

Assisting the Assistant: A Cobot for Voice Customer Support

Corrado,

Giliberti,

Gozzi

et al. 2023

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

Despite recent advances in automation, customer support still requires a substantial amount of human intervention through voice channels. With the aim of improving the work of human assistants, we developed a collaborative bot (cobot) to help them in the process of handling customer voice interactions. The cobot is a reasoning agent that starts from loading background customer data into a dynamic knowledge graph. Then it captures the audio stream of the conversation, converts it to text in real time, analyzes the blocks of conversation with neural technologies and “thinks” about the results. Assistants can also supply data to the cobot, based on the information they gather from the ongoing conversation. The reasoning agent provides information and action suggestions to the human assistant by applying heuristics on data collected from both automatic and human sources, based on a task and domain-specific conceptual models (ontologies). While designing a prototypical solution for utility services in Italy, we are faced with many problems, including spontaneous speech understanding, factual and linguistic knowledge representation, and efficient heuristic reasoning. We adopted a standards-based approach and experimented with open source reasoners and publicly available language models. The paper presents preliminary findings and outlines the system design, with focus on the interplay of neural language processing and logic reasoning.

show abstract

“…Auto-generated transcripts serve an integral part in providing equitable access of online video content to a wide variety of individuals and groups while voice based assistants enable users to avail a lot of online services with voice-based commands. In the past two decades, designing efficient ASRs have been an active area of research resulting in substantial advancement in the accuracy of these tools (Hannun 2021).…”

Section: Introductionmentioning

confidence: 99%

A Deep Dive into the Disparity of Word Error Rates across Thousands of NPTEL MOOC Videos

Rai,

Jaiswal,

Mukherjee

2024

ICWSM

View full text Add to dashboard Cite

Automatic speech recognition (ASR) systems are designed to transcribe spoken language into written text and find utility in a variety of applications including voice assistants and transcription services. However, it has been observed that state-of-the-art ASR systems which deliver impressive benchmark results, struggle with speakers of certain regions or demographics due to variation in their speech properties. In this work, we describe the curation of a massive speech dataset of 8740 hours consisting of ~9.8K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. The dataset is sourced from the very popular NPTEL MOOC platform. We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India. While there exists disparity due to gender, native region, age and speech rate of speakers, disparity based on caste is non-existent. We also observe statistically significant disparity across the disciplines of the lectures. These results indicate the need of more inclusive and robust ASR systems and more representational datasets for disparity evaluation in them.

show abstract

The History of Speech Recognition to the Year 2030

Cited by 5 publications

References 17 publications

PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations

PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations

Assisting the Assistant: A Cobot for Voice Customer Support

A Deep Dive into the Disparity of Word Error Rates across Thousands of NPTEL MOOC Videos

Contact Info

Product

Resources

About