TorchAudio: Building Blocks for Audio and Speech Processing

Yang, Yao-Yuan; Hira, Moto; Ni, Zhaoheng; Chourdia, Anjali; Astafurov, Artyom; Chen, Caroline; Yeh, Ching-Feng; Puhrsch, Christian; Pollack, David; Genzel, Dmitriy; Greenberg, Donny; Yang, Emily; Lian, Jason; Mahadeokar, Jay; Hwang, Jeff Yi-Fu; Chen, Ji; Goldsborough, Peter; Roy, Prabhat; Narenthiran, Sean; Watanabe, Soichi; Chintala, Soumith; Quenneville-Bélair, Vincent; Shi, Yangyang

doi:10.48550/arxiv.2110.15018

Cited by 7 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LM Beam-Search Decoding In all our experimental results, we report WER and CER, both with greedy and LM-beam search decoding. We rely on the lexicon-based beam-search decoder (with a word-based LM) from the flashlight framework [21], ported in torchaudio [40]. The same beam-search decoder is used to generate PLs in cross-lingual PL 5 .…”

Section: Monolingual Language Modelsmentioning

confidence: 99%

slimIPL: Language-Model-Free Iterative Pseudo-Labeling

Likhomanenko

Kahn

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Recent results in end-to-end ASR have demonstrated the efficacy of simple pseudo-labeling for semisupervised models trained both with Connectionist Temporal Classification (CTC) and Sequenceto-Sequence (seq2seq) losses. Iterative Pseudo-Labeling (IPL), which continuously trains a single model using pseudo-labels iteratively re-generated as the model learns, has been shown to further increase performance in ASR. We improve upon the IPL algorithm: as the model learns, we propose to iteratively re-generate transcriptions with hard labels (the most probable tokens) assignments, that is without a language model. We call this approach Language-Model-Free IPL (slimIPL) and we give a resultant training setup for CTC and seq2seq models. At inference, our experiments show that decoding with a strong language model is more beneficial with slimIPL than IPL, as IPL exhibits some language model over-fitting issues. Compared to prior work on semi-supervised and unsupervised approaches, slimIPL not only simplifies the training process, but also achieves competitive and state-of-the-art results on LIBRISPEECH test sets in both standard and low-resource settings.

show abstract

Section: Monolingual Language Modelsmentioning

confidence: 99%

slimIPL: Language-Model-Free Iterative Pseudo-Labeling

Likhomanenko

Kahn

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…First, feature extraction is performed on the raw audio. For MFCC calculation, we use the implementation by torchaudio [16] with the default parameters and a sample rate of 16 kHz. The XLS-R feature extraction is based on the facebook/wav2vec2-xls-r-300m model available at the Hug-gingFace [17] model hub.…”

Section: Feature Extractionmentioning

confidence: 99%

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Bastiaan¹,

Balabin²,

Vandenberghe³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Speech quality in online conferencing applications is typically assessed through human judgements in the form of the mean opinion score (MOS) metric. Since such a labor-intensive approach is not feasible for large-scale speech quality assessments in most settings, the focus has shifted towards automated MOS prediction through end-to-end training of deep neural networks (DNN). Instead of training a network from scratch, we propose to leverage the speech representations from the pre-trained wav2vec-based XLS-R model. However, the number of parameters of such a model exceeds task-specific DNNs by several orders of magnitude, which poses a challenge for resulting fine-tuning procedures on smaller datasets. Therefore, we opt to use pre-trained speech representations from XLS-R in a feature extraction rather than a fine-tuning setting, thereby significantly reducing the number of trainable model parameters. We compare our proposed XLS-R-based feature extractor to a Mel-frequency cepstral coefficient (MFCC)-based one, and experiment with various combinations of bidirectional long short term memory (Bi-LSTM) and attention pooling feedforward (AttPoolFF) networks trained on the output of the feature extractors. We demonstrate the increased performance of pre-trained XLS-R embeddings in terms a reduced root mean squared error (RMSE) on the ConferencingSpeech 2022 MOS prediction task.

show abstract

“…Audio file pre-processing operations have been conducted with the Python libraries NumPy [64] for operations on arrays and LibRosa [65] for audio file loading, resampling, normalizing and writing. Feature extraction procedures have also been performed with LibRosa and Torchaudio [66] libraries.…”

Section: A3siren-recordingsmentioning

confidence: 99%

Few-Shot Emergency Siren Detection

Cantarini

Gabrielli

Squartini

2022

Sensors

View full text Add to dashboard Cite

It is a well-established practice to build a robust system for sound event detection by training supervised deep learning models on large datasets, but audio data collection and labeling are often challenging and require large amounts of effort. This paper proposes a workflow based on few-shot metric learning for emergency siren detection performed in steps: prototypical networks are trained on publicly available sources or synthetic data in multiple combinations, and at inference time, the best knowledge learned in associating a sound with its class representation is transferred to identify ambulance sirens, given only a few instances for the prototype computation. Performance is evaluated on siren recordings acquired by sensors inside and outside the cabin of an equipped car, investigating the contribution of filtering techniques for background noise reduction. The results show the effectiveness of the proposed approach, achieving AUPRC scores equal to 0.86 and 0.91 in unfiltered and filtered conditions, respectively, outperforming a convolutional baseline model with and without fine-tuning for domain adaptation. Extensive experiments conducted on several recording sensor placements prove that few-shot learning is a reliable technique even in real-world scenarios and gives valuable insights for developing an in-car emergency vehicle detection system.

show abstract

TorchAudio: Building Blocks for Audio and Speech Processing

Cited by 7 publications

References 0 publications

slimIPL: Language-Model-Free Iterative Pseudo-Labeling

slimIPL: Language-Model-Free Iterative Pseudo-Labeling

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Few-Shot Emergency Siren Detection

Contact Info

Product

Resources

About