nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks

Cheuk, Kin Wai; Anderson, Hans; Agres, Kat; Herremans, Dorien

doi:10.1109/access.2020.3019084

Cited by 61 publications

(29 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For each signal sampled at kHz, we used a STFT with a Hann window of points and a shifting interval of points ( ms) for calculating the amplitude spectrogram on the logarithmic frequency axis having bins per semitone (i.e. bin per cents) between Hz (C1) and Hz (C7) [30]. We then computed the HCQT-like spectrogram by stacking the -harmonic-shifted versions of the original spectrogram, where (i.e.…”

Section: Discussionmentioning

confidence: 99%

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Nishikimi

Nakamura

Goto

et al. 2021

SIP

View full text Add to dashboard Cite

This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.

show abstract

Section: Discussionmentioning

confidence: 99%

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Nishikimi

Nakamura

Goto

et al. 2021

SIP

View full text Add to dashboard Cite

show abstract

“…where f m−1 , f m , f m+1 are evenly-spaced discrete frequencies in the Mel-scaled frequency domain, whereas the index m ∈ {1, 2, ..., n mel } indicates the number of the filter in the filterbank. The log-Mel spectrogram is obtained by taking the logarithm of the STFT result, multiplied by the filterbanks' coefficients, at each timestep [35]:…”

Section: ) Visual Data Analysismentioning

confidence: 99%

An End-To-End Emotion Recognition Framework Based on Temporal Aggregation of Multimodal Information

et al. 2021

View full text Add to dashboard Cite

Humans express and perceive emotions in a multimodal manner. The multimodal information is intrinsically fused by the human sensory system in a complex manner. Emulating a temporal desynchronisation between modalities, in this paper, we design an end-to-end neural network architecture, called TA-AVN, that aggregates temporal audio and video information in an asynchronous setting in order to determine the emotional state of a subject. The feature descriptors for audio and video representations are extracted using simple Convolutional Neural Networks (CNNs), leading to real-time processing. Undoubtedly, collecting annotated training data remains an important challenge when training emotion recognition systems, both in terms of effort and expertise required. The proposed approach solves this problem by providing a natural augmentation technique that allows achieving a high accuracy rate even when the amount of annotated training data is limited. The framework is tested on three challenging multimodal reference datasets for the emotion recognition task, namely the benchmark datasets CREMA-D and RAVDESS, and one dataset from the FG2020's challenge related to emotion recognition. The results prove the effectiveness of our approach and our end-to-end framework achieves state-of-the-art results on the CREMA-D and RAVDESS datasets.INDEX TERMS Emotion recognition, multimodal data, audiovisual information, augmentation techniques, convolutional neural network, real-time processing.

show abstract

“…Even though CNNs were initially conceived for artificial vision, they have yielded good results in problems in which visual representations of input data are feasible. When working with audio, CNNs are frequently used in combination with mid-level time-frequency visual representations, known as spectrograms [32]. The Audioset VGGish model is an example of this.…”

Section: Fully Connected Neural Network (Fcnns)mentioning

confidence: 99%

“…CNNs normally use audio visual representations (spectrograms) as inputs [32], [62]. After the CNN is trained, it is not uncommon to see how researchers apply a transfer learning approach and extract the embeddings of the final layers from the network to generate pre-trained models [57], [63].…”

Section: State Of the Artmentioning

confidence: 99%

Web-Based Music Genre Classification for Timeline Song Visualization and Analysis

Castillo¹,

Flores²

2021

IEEE Access

View full text Add to dashboard Cite

This paper presents a web application that retrieves songs from YouTube and classifies them into music genres. The tool explained in this study is based on models trained using the musical collection data from Audioset. For this purpose, we have used classifiers from distinct Machine Learning paradigms: Probabilistic Graphical Models (Naive Bayes), Feed-forward and Recurrent Neural Networks and Support Vector Machines (SVMs). All these models were trained in a multi-label classification scenario. Because genres may vary along a song's timeline, we perform classification in chunks of ten seconds. This capability is enabled by Audioset, which offers 10-second samples. The visualization output presents this temporal information in real time, synced with the music video being played, presenting classification results in stacked area charts, where scores for the top-10 labels obtained per chunk are shown. We briefly explain the theoretical and scientific basis of the problem and the proposed classifiers. Subsequently, we show how the application works in practice, using three distinct songs as cases of study, which are then analyzed and compared with online categorizations to discuss models performance and music genre classification challenges.

show abstract

nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks

Cited by 61 publications

References 36 publications

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

An End-To-End Emotion Recognition Framework Based on Temporal Aggregation of Multimodal Information

Web-Based Music Genre Classification for Timeline Song Visualization and Analysis

Contact Info

Product

Resources

About