High-resolution Piano Transcription with Pedals by Regressing Onset and Offset Times

Kong, Qiuqiang; Li, Bochen; Song, Xuchen; Yuan, Weihua; Wang, Yuxuan

doi:10.48550/arxiv.2010.01815

Cited by 13 publications

(12 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…5 ms in Handel [34]). We leave open the possibility that our results could be improved further with finer event resolution, for example by predicting continuous times as in Kong et al [19].…”

Section: Inputs and Outputsmentioning

confidence: 92%

“…Kong et al [19] achieve higher transcription accuracy by using regression to predict precise continuous onset/offset times, using a similar network architecture to Hawthorne et al [3]. Kim & Bello [20] use an adversarial loss on the transcription output to encourage a transcription model to output more plausible piano rolls.…”

Section: Related Work 21 Piano Transcriptionmentioning

confidence: 99%

“…MAESTRO V3.0.0 also contains sostenuto and una corda pedal events, though our model (and evaluation) does not make use of these. We also do not model sustain pedal events directly as in Kong et al [19], instead extending note durations while the sustain pedal is pressed similar to Hawthorne et al [2].…”

Section: Datasetsmentioning

confidence: 99%

See 2 more Smart Citations

Sequence-to-Sequence Piano Transcription with Transformers

Hawthorne¹,

Simon²,

Swavely³

et al. 2021

Preprint

View full text Add to dashboard Cite

Automatic Music Transcription has seen significant progress in recent years by training custom deep neural networks on large datasets. However, these models have required extensive domain-specific design of network architectures, input/output representations, and complex decoding schemes. In this work, we show that equivalent performance can be achieved using a generic encoderdecoder Transformer with standard decoding methods. We demonstrate that the model can learn to translate spectrogram inputs directly to MIDI-like output events for several transcription tasks. This sequence-to-sequence approach simplifies transcription by jointly modeling audio features and language-like output dependencies, thus removing the need for task-specific architectures. These results point toward possibilities for creating new Music Information Retrieval models by focusing on dataset creation and labeling rather than custom model design. Recent progress in piano transcription has been largely driven by two factors: the construction and release of datasets containing aligned piano audio and MIDI (most notably MAPS [1] and MAESTRO [2]) and the use of deep neural networks with architectures specifically designed for piano transcription (e.g., the Onsets and Frames architecture that models note onsets and note presence separately [3]). While domain-specific models have lead to improvements on benchmark datasets, it is not clear if these Work done as a Google Brain Student Researcher.

show abstract

Section: Inputs and Outputsmentioning

confidence: 92%

Section: Related Work 21 Piano Transcriptionmentioning

confidence: 99%

See 1 more Smart Citation

Sequence-to-Sequence Piano Transcription with Transformers

Hawthorne¹,

Simon²,

Swavely³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We use the GiantMIDI-Piano dataset [17], which includes 10,854 piano performances written by 2,786 composer transcribed using [18] from live recordings and encoded in the MIDI format. We use a (90/10/0) split for the dataset, by assigning one file every ten to the validation set.…”

Section: Datasetmentioning

confidence: 99%

The Piano Inpainting Application

Hadjeres,

Crestel

2021

Preprint

View full text Add to dashboard Cite

Autoregressive models are now capable of generating high-quality minute-long expressive MIDI piano performances. Even though this progress suggests new tools to assist music composition, we observe that generative algorithms are still not widely used by artists due to the limited control they offer, prohibitive inference times or the lack of integration within musicians' workflows. In this work, we present the Piano Inpainting Application (PIA), a generative model focused on "inpainting" piano performances, as we believe that this elementary operation (restoring missing parts of a piano performance) encourages human-machine interaction and opens up new ways to approach music composition. Our approach relies on an encoder-decoder Linear Transformer architecture trained on a novel representation for MIDI piano performances termed Structured MIDI Encoding. By uncovering an interesting synergy between Linear Transformers and our inpainting task, we are able to efficiently inpaint contiguous regions of a piano performance, which makes our model suitable for interactive and responsive A.I.-assisted composition. Finally, we introduce our freely-available Ableton Live PIA plugin, which allows musicians to smoothly generate or modify any MIDI clip using PIA within a widely-used professional Digital Audio Workstation.

show abstract

“…As a result, there are a large number of transcription models whose success relies on hand-designed representations for piano transcription. For instance, the Onsets & Frames model (Hawthorne et al, 2017) uses dedicated outputs for detecting piano onsets and the note being played; Kelz et al (2019) represents the entire amplitude envelope of a piano note; and Kong et al (2020) additionally models piano foot pedal events (a piano-specific way of controlling a note's sustain). Single-instrument transcription models have also been developed for other instruments such as guitar (Xi et al, 2018) and drums (Cartwright & Bello, 2018;Callender et al, 2020), though these instruments have received less attention than piano.…”

Section: Music Transcriptionmentioning

confidence: 99%

MT3: Multi-Task Multitrack Music Transcription

Gardner¹,

Simon²,

Manilow³

et al. 2021

Preprint

View full text Add to dashboard Cite

Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT often requires transcribing multiple instruments simultaneously, all while preserving fine-scale pitch and timing information. Further, many AMT datasets are "low-resource", as even expert musicians find music transcription difficult and time-consuming. Thus, prior work has focused on task-specific architectures, tailored to the individual instruments of each task. In this work, motivated by the promising results of sequence-to-sequence transfer learning for low-resource Natural Language Processing (NLP), we demonstrate that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets. We show this unified training framework achieves high-quality transcription results across a range of datasets, dramatically improving performance for low-resource instruments (such as guitar), while preserving strong performance for abundant instruments (such as piano). Finally, by expanding the scope of AMT, we expose the need for more consistent evaluation metrics and better dataset alignment, and provide a strong baseline for this new direction of multi-task AMT. 1

show abstract

High-resolution Piano Transcription with Pedals by Regressing Onset and Offset Times

Cited by 13 publications

References 29 publications

Sequence-to-Sequence Piano Transcription with Transformers

Sequence-to-Sequence Piano Transcription with Transformers

The Piano Inpainting Application

MT3: Multi-Task Multitrack Music Transcription

Contact Info

Product

Resources

About