Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Nishikimi, Ryo; Nakamura, Eijiro; Goto, Masataka; Yoshii, Kazuyoshi

doi:10.1017/atsip.2021.4

Cited by 10 publications

(2 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [36], the author proposed a Bayesian hierarchical hidden semi-Markov model (HHSMM), which generates a note sequence and consists of three sub-models describing local keys, pitches, and onset score times. Later, a CRNN-HSMM hybrid model was proposed in [37], estimating the most likely notes from the music signal using the Viterbi algorithm. This method improved the performance of AST and was superior to HSMM-based method, the most advanced method at that time [36].…”

Section: Singing Transcriptionmentioning

confidence: 99%

MusicYOLO: A Vision-Based Framework for Automatic Singing Transcription

Wang

Tian

Yang

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Automatic singing transcription (AST), which refers to the process of inferring the onset, offset, and pitch from the singing audio, is of great significance in music information retrieval. Most AST models use the convolutional neural network to extract spectral features and predict the onset and offset moments separately. The frame-level probabilities are inferred first, and then the note-level transcription results are obtained through post-processing. In this paper, a new AST framework called MusicYOLO is proposed, which obtains the note-level transcription results directly. The onset/offset detection is based on the object detection model YOLOX, and the pitch labeling is completed by a spectrogram peak search. Compared with previous methods, the MusicYOLO detects note objects rather than isolated onset/offset moments, thus greatly enhancing the transcription performance. On the sight-singing vocal dataset (SSVD) established in this paper, the MusicYOLO achieves an 84.60% transcription F1-score, which is the state-of-the-art method.

show abstract

Section: Singing Transcriptionmentioning

confidence: 99%

MusicYOLO: A Vision-Based Framework for Automatic Singing Transcription

Wang

Tian

Yang

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Multiple methods have been proposed for estimating notes from pitch posteriorgrams e.g. using median filtering [11], Hidden Markov Models [16] or neural networks [20,21]. While most approaches consider each semitone independently, some approaches attempt to model the interactions between notes, using spectral likelihood models [1,18], or music language models [3,17].…”

Section: Background and Related Workmentioning

confidence: 99%

A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation

Bittner¹,

Bosch²,

Rubinstein³

et al. 2022

Preprint

View full text Add to dashboard Cite

Automatic Music Transcription (AMT) has been recognized as a key enabling technology with a wide range of applications. Given the task's complexity, best results have typically been reported for systems focusing on specific settings, e.g. instrument-specific systems tend to yield improved results over instrument-agnostic methods. Similarly, higher accuracy can be obtained when only estimating frame-wise f0 values and neglecting the harder note event detection. Despite their high accuracy, such specialized systems often cannot be deployed in the real-world. Storage and network constraints prohibit the use of multiple specialized models, while memory and run-time constraints limit their complexity. In this paper, we propose a lightweight neural network for musical instrument transcription, which supports polyphonic outputs and generalizes to a wide variety of instruments (including vocals). Our model is trained to jointly predict frame-wise onsets, multipitch and note activations, and we experimentally show that this multi-output structure improves the resulting frame-level note accuracy. Despite its simplicity, benchmark results show our system's note estimation to be substantially better than a comparable baseline, and its frame-level accuracy to be only marginally below those of specialized state-of-the-art AMT systems. With this work we hope to encourage the community to further investigate low-resource, instrument-agnostic AMT systems.

show abstract