Qiantong Xu scite author profile

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semisupervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.Index Terms-unsupervised and semi-supervised learning, distant supervision, dataset, zero-and low resource ASR.

show abstract

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Baevski¹,

Hsu²,

Xu³

et al. 2022

Preprint

118

View full text Add to dashboard Cite

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a selfdistillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches. Models and code are available at www.github.com/pytorch/fairseq/ tree/master/examples/data2vec.

show abstract

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Babu¹,

Wang²,

Tjandra³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLin-gua107 language identification. Moreover, we show that with sufficient model size, cross-lingual pretraining can perform as well as English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining. We hope XLS-R can help to improve speech processing tasks for many more languages of the world. Models and code are available at www.github.com/ pytorch/fairseq/tree/master/examples/wav2vec/xlsr. 1 * Equal contribution. † Work done while at Facebook AI. ‡ Equal advising.

show abstract

Using the Dimensions of Mastery Questionnaire (DMQ) to Assess Mastery Motivation of English- and Chinese-Speaking Children

Morgan¹,

Wang²,

Liao³

et al.

View full text Add to dashboard Cite

MLS: A Large-Scale Multilingual Dataset for Speech Research

Pratap¹,

Xu²,

Sriram³

et al. 2020

170

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Qiantong Xu

Libri-Light: A Benchmark for ASR with Limited or No Supervision

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Using the Dimensions of Mastery Questionnaire (DMQ) to Assess Mastery Motivation of English- and Chinese-Speaking Children

MLS: A Large-Scale Multilingual Dataset for Speech Research

Contact Info

Product

Resources

About