Torchaudio: Building Blocks for Audio and Speech Processing

Yang, Yao-Yuan; Hira, Moto; Ni, Zhaoheng; Chourdia, Anjali; Astafurov, Artyom; Chen, Caroline; Yeh, Ching-Feng; Puhrsch, Christian; Pollack, David; Genzel, Dmitriy; Greenberg, Donny; Yang, Emily; Lian, Jason; Mahadeokar, Jay; Hwang, Jeff Yi-Fu; Chen, Ji; Goldsborough, Peter; Roy, Prabhat; Narenthiran, Sean; Watanabe, Shinji; Chintala, Soumith; Quenneville-Bélair, Vincent; Shi, Yangyang

doi:10.1109/icassp43922.2022.9747236

Cited by 48 publications

(10 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the last set of experiments, we combined the decoder frame reduction with lattice reduction. Note that, different from the RNN-T loss implementation in [30] which calculates the costs of all the edges in the RNN-T lattice in one pass, our baseline implementation already splits the lattice along the time axis into several strips (strip length 8 in our experiments, i.e., the confidence region width equal to 8 in Figure 3) and calculates the edge cost in each strips sequentially. This implementation avoids allocating huge TPU memory, but it is relatively slow because of the sequential computation which is not friendly to TPU.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Accelerating RNN-T Training and Inference Using CTC guidance

Wang¹,

Chen²,

Zheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose a novel method to accelerate training and inference process of recurrent neural network transducer (RNN-T) based on the guidance from a co-trained connectionist temporal classification (CTC) model. We made a key assumption that if an encoder embedding frame is classified as a blank frame by the CTC model, it is likely that this frame will be aligned to blank for all the partial alignments or hypotheses in RNN-T and it can be discarded from the decoder input. We also show that this frame reduction operation can be applied in the middle of the encoder, which result in significant speed up for the training and inference in RNN-T. We further show that the CTC alignment, a by-product of the CTC decoder, can also be used to perform lattice reduction for RNN-T during training. Our method is evaluated on the Librispeech and SpeechStew tasks. We demonstrate that the proposed method is able to accelerate the RNN-T inference by 2.2 times with similar or slightly better word error rates (WER).

show abstract

Section: Resultsmentioning

confidence: 99%

“…Compared with our baseline implementation, we mainly save computation. On the other hand, our proposed lattice reduction method can equally applied to other implementation like the one in [30] and could save both computation and TPU/GPU memory considerably. Results are presented in Table 3.…”

Section: Resultsmentioning

confidence: 99%

Accelerating RNN-T Training and Inference Using CTC guidance

Wang¹,

Chen²,

Zheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The train-clean-360 subset, dev-clean subset and test-clean subset of the corpus contain 921, 40 and 40 speakers respectively. We used voice activity detection (VAD) provided by the Torchaudio module [18] to remove silent segments. The speech segments that are less than 2 seconds after VAD were discarded.…”

Section: Methodsmentioning

confidence: 99%

End-to-end Two-dimensional Sound Source Localization With Ad-hoc Microphone Arrays

Gong

Liu

Zhang

2022

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

View full text Add to dashboard Cite

Recently, an end-to-end two-dimensional sound source localization algorithm with ad-hoc microphone arrays formulates the sound source localization problem as a classification problem. The algorithm divides the target indoor space into a set of local areas, and predicts the local area where the speaker locates. However, the local areas are encoded by one-hot code, which may lose the connections between the local areas due to quantization errors. In this paper, we propose a new soft label coding method, named label smoothing, for the classificationbased two-dimensional sound source location with ad-hoc microphone arrays. The core idea is to take the geometric connection between the classes into the label coding process.The first one is named static soft label coding (SSLC), which modifies the one-hot codes into soft codes based on the distances between the local areas. Because SSLC is handcrafted which may not be optimal, the second one, named dynamic soft label coding (DSLC), further rectifies SSLC, by learning the soft codes according to the statistics of the predictions produced by the classification-based localization model in the training stage. Experimental results show that the proposed methods can effectively improve the localization accuracy.

show abstract

“…These recordings were sampled at 8 kHz. We upsample the audio to 16 kHz as is required for inputs to the various models investigated in this paper and is commonly done [7], [44], [55]. We use the Mississippi State University (MSU) transcripts, which includes transcript corrections and more accurate word alignments, which are important for frame-level detection [12].…”

Section: Datasetmentioning

confidence: 99%

Toward A Multimodal Approach for Disfluency Detection and Categorization

Romana

2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech disfluencies, such as filled pauses or repetitions, are disruptions in the typical flow of speech. Stuttering is a speech disorder characterized by a high rate of disfluencies, but all individuals speak with some disfluencies and the rates of disfluencies may by increased by factors such as cognitive load. Clinically, automatic disfluency detection may help in treatment planning for individuals who stutter. Outside of the clinic, automatic disfluency detection may serve as a pre-processing step to improve natural language understanding in downstream applications. With this wide range of applications in mind, we investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization. Each of these methods relies on audio as an input. First, we evaluate several automatic speech recognition (ASR) systems in terms of their ability to transcribe disfluencies, measured using disfluency error rates. We then use these ASR transcripts as input to a language-based disfluency detection model. We find that disfluency detection performance is largely limited by the quality of transcripts and alignments. We find that an acoustic-based approach that does not require transcription as an intermediate step outperforms the ASR language approach. Finally, we present multimodal architectures which we find improve disfluency detection performance over the unimodal approaches. Ultimately, this work introduces novel approaches for automatic frame-level disfluency and categorization. In the long term, this will help researchers incorporate automatic disfluency detection into a range of applications.

show abstract

Torchaudio: Building Blocks for Audio and Speech Processing

Cited by 48 publications

References 11 publications

Accelerating RNN-T Training and Inference Using CTC guidance

Accelerating RNN-T Training and Inference Using CTC guidance

End-to-end Two-dimensional Sound Source Localization With Ad-hoc Microphone Arrays

Toward A Multimodal Approach for Disfluency Detection and Categorization

Contact Info

Product

Resources

About