2020
DOI: 10.1109/access.2020.3019084
|View full text |Cite
|
Sign up to set email alerts
|

nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks

Abstract: In this paper, we present nnAudio, a new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion. It allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk. Moreover, this approach also allows backpropagation on the waveforms-to-spectrograms transformation layer, and hence, the transformation process can be made… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
28
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 61 publications
(29 citation statements)
references
References 36 publications
0
28
0
Order By: Relevance
“…For each signal sampled at kHz, we used a STFT with a Hann window of points and a shifting interval of points ( ms) for calculating the amplitude spectrogram on the logarithmic frequency axis having bins per semitone (i.e. bin per cents) between Hz (C1) and Hz (C7) [30]. We then computed the HCQT-like spectrogram by stacking the -harmonic-shifted versions of the original spectrogram, where (i.e.…”
Section: Discussionmentioning
confidence: 99%
“…For each signal sampled at kHz, we used a STFT with a Hann window of points and a shifting interval of points ( ms) for calculating the amplitude spectrogram on the logarithmic frequency axis having bins per semitone (i.e. bin per cents) between Hz (C1) and Hz (C7) [30]. We then computed the HCQT-like spectrogram by stacking the -harmonic-shifted versions of the original spectrogram, where (i.e.…”
Section: Discussionmentioning
confidence: 99%
“…where f m−1 , f m , f m+1 are evenly-spaced discrete frequencies in the Mel-scaled frequency domain, whereas the index m ∈ {1, 2, ..., n mel } indicates the number of the filter in the filterbank. The log-Mel spectrogram is obtained by taking the logarithm of the STFT result, multiplied by the filterbanks' coefficients, at each timestep [35]:…”
Section: ) Visual Data Analysismentioning
confidence: 99%
“…Even though CNNs were initially conceived for artificial vision, they have yielded good results in problems in which visual representations of input data are feasible. When working with audio, CNNs are frequently used in combination with mid-level time-frequency visual representations, known as spectrograms [32]. The Audioset VGGish model is an example of this.…”
Section: Fully Connected Neural Network (Fcnns)mentioning
confidence: 99%
“…CNNs normally use audio visual representations (spectrograms) as inputs [32], [62]. After the CNN is trained, it is not uncommon to see how researchers apply a transfer learning approach and extract the embeddings of the final layers from the network to generate pre-trained models [57], [63].…”
Section: State Of the Artmentioning
confidence: 99%