ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747236
|View full text |Cite
|
Sign up to set email alerts
|

Torchaudio: Building Blocks for Audio and Speech Processing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 48 publications
(10 citation statements)
references
References 11 publications
0
8
0
Order By: Relevance
“…In the last set of experiments, we combined the decoder frame reduction with lattice reduction. Note that, different from the RNN-T loss implementation in [30] which calculates the costs of all the edges in the RNN-T lattice in one pass, our baseline implementation already splits the lattice along the time axis into several strips (strip length 8 in our experiments, i.e., the confidence region width equal to 8 in Figure 3) and calculates the edge cost in each strips sequentially. This implementation avoids allocating huge TPU memory, but it is relatively slow because of the sequential computation which is not friendly to TPU.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…In the last set of experiments, we combined the decoder frame reduction with lattice reduction. Note that, different from the RNN-T loss implementation in [30] which calculates the costs of all the edges in the RNN-T lattice in one pass, our baseline implementation already splits the lattice along the time axis into several strips (strip length 8 in our experiments, i.e., the confidence region width equal to 8 in Figure 3) and calculates the edge cost in each strips sequentially. This implementation avoids allocating huge TPU memory, but it is relatively slow because of the sequential computation which is not friendly to TPU.…”
Section: Resultsmentioning
confidence: 99%
“…Compared with our baseline implementation, we mainly save computation. On the other hand, our proposed lattice reduction method can equally applied to other implementation like the one in [30] and could save both computation and TPU/GPU memory considerably. Results are presented in Table 3.…”
Section: Resultsmentioning
confidence: 99%
“…The train-clean-360 subset, dev-clean subset and test-clean subset of the corpus contain 921, 40 and 40 speakers respectively. We used voice activity detection (VAD) provided by the Torchaudio module [18] to remove silent segments. The speech segments that are less than 2 seconds after VAD were discarded.…”
Section: Methodsmentioning
confidence: 99%
“…These recordings were sampled at 8 kHz. We upsample the audio to 16 kHz as is required for inputs to the various models investigated in this paper and is commonly done [7], [44], [55]. We use the Mississippi State University (MSU) transcripts, which includes transcript corrections and more accurate word alignments, which are important for frame-level detection [12].…”
Section: Datasetmentioning
confidence: 99%