Onsets and Frames: Dual-Objective Piano Transcription

Hawthorne, Curtis; Elsen, Erich; Song, Jialin; Roberts, Adam; Simon, Ian; Raffel, Colin; Engel, Jesse; Oore, Sageev; Eck, Douglas

doi:10.48550/arxiv.1710.11153

Cited by 11 publications

(26 citation statements)

References 16 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Indeed, for piano performance videos from Internet (usually without accompanied ground truth Midi). We retrieve the Pseudo ground truth (GT) Midi from audio with the Onset and Frames framework [25]. This allows us to avoid hardware constraints of the instrument and to use any video, even those recorded in an unconstrained setup.…”

Section: Methodsmentioning

confidence: 99%

“…Such Midi is typically obtained with an electronic keyboard, a process that make creation of the training data to be limited. To overcome this challenge, the Onsets and Frames framework enables to transcript audio waveform to Midi [25]. A recent work used this framework to obtain Pseudo Ground truth Midi and implemented a ResNet [26], to predict the pitch onsets events (times and identities of keys being pressed) given video frames stream [27].…”

Section: Related Workmentioning

confidence: 99%

“…Moreover, because M:,t is predicted at each frame individually, Roll predictions do not have temporal correlation. In addition, since pseudo GT Midi is generated from Onset and Frames framework [25] which depends on the audio stream, one common phenomenon that appears is: if the performer sustains a key for sufficiently long time, the magnitude of the corresponding frequency will gradually decay to zero and this key in pseudo GT Midi will be marked as off afterwards, however, since our Video2Roll Net depends on visual information only, all pressed keys are still considered as active but this prediction will not match the reality of the audio. Examples can be seen in Fig.…”

Section: Roll2midi Netmentioning

confidence: 99%

See 2 more Smart Citations

Audeo: Audio Generation for a Silent Performance Video

Su,

Liu,

Shlizerman

2020

Preprint

View full text Add to dashboard Cite

We present a novel system that gets as an input video frames of a musician playing the piano and generates the music for that video. Generation of music from visual cues is a challenging problem and it is not clear whether it is an attainable goal at all. Our main aim in this work is to explore the plausibility of such a transformation and to identify cues and components able to carry the association of sounds with visual events. To achieve the transformation we built a full pipeline named 'Audeo' containing three components. We first translate the video frames of the keyboard and the musician hand movements into raw mechanical musical symbolic representation Piano-Roll (Roll) for each video frame which represents the keys pressed at each time step. We then adapt the Roll to be amenable for audio synthesis by including temporal correlations. This step turns out to be critical for meaningful audio generation. As a last step, we implement Midi synthesizers to generate realistic music. Audeo converts video to audio smoothly and clearly with only a few setup constraints. We evaluate Audeo on 'in the wild' piano performance videos and obtain that their generated music is of reasonable audio quality and can be successfully recognized with high precision by popular music identification software.Preprint. Under review.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Roll2midi Netmentioning

confidence: 99%

See 1 more Smart Citation

Audeo: Audio Generation for a Silent Performance Video

Su,

Liu,

Shlizerman

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…As a result, there are a large number of transcription models whose success relies on hand-designed representations for piano transcription. For instance, the Onsets & Frames model (Hawthorne et al, 2017) uses dedicated outputs for detecting piano onsets and the note being played; Kelz et al (2019) represents the entire amplitude envelope of a piano note; and Kong et al (2020) additionally models piano foot pedal events (a piano-specific way of controlling a note's sustain). Single-instrument transcription models have also been developed for other instruments such as guitar (Xi et al, 2018) and drums (Cartwright & Bello, 2018;Callender et al, 2020), though these instruments have received less attention than piano.…”

Section: Music Transcriptionmentioning

confidence: 99%

MT3: Multi-Task Multitrack Music Transcription

Gardner¹,

Simon²,

Manilow³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT often requires transcribing multiple instruments simultaneously, all while preserving fine-scale pitch and timing information. Further, many AMT datasets are "low-resource", as even expert musicians find music transcription difficult and time-consuming. Thus, prior work has focused on task-specific architectures, tailored to the individual instruments of each task. In this work, motivated by the promising results of sequence-to-sequence transfer learning for low-resource Natural Language Processing (NLP), we demonstrate that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets. We show this unified training framework achieves high-quality transcription results across a range of datasets, dramatically improving performance for low-resource instruments (such as guitar), while preserving strong performance for abundant instruments (such as piano). Finally, by expanding the scope of AMT, we expose the need for more consistent evaluation metrics and better dataset alignment, and provide a strong baseline for this new direction of multi-task AMT. 1

show abstract

“…Onset can expresses the beginning of music notes and it is the most basic expression form of music rhythm [18,12,2]. Beat is another form of rhythm, and there is a lot of work on beat detection [4,25,26].…”

Section: Introductionmentioning

confidence: 99%

Music2Dance: DanceNet for Music-driven Dance Generation

Zhuang¹,

Wang²,

Xia³

et al. 2020

Preprint

View full text Add to dashboard Cite

Synthesize human motions from music, i.e., music to dance, is appealing and attracts lots of research interests in recent years. It is challenging due to not only the requirement of realistic and complex human motions for dance, but more importantly, the synthesized motions should be consistent with the style, rhythm and melody of the music. In this paper, we propose a novel autoregressive generative model, DanceNet, to take the style, rhythm and melody of music as the control signals to generate 3D dance motions with high realism and diversity. To boost the performance of our proposed model, we capture several synchronized music-dance pairs by professional dancers, and build a highquality music-dance pair dataset. Experiments have demonstrated that the proposed method can achieve the state-of-the-art results.

show abstract

Onsets and Frames: Dual-Objective Piano Transcription

Cited by 11 publications

References 16 publications

Audeo: Audio Generation for a Silent Performance Video

Audeo: Audio Generation for a Silent Performance Video

MT3: Multi-Task Multitrack Music Transcription

Music2Dance: DanceNet for Music-driven Dance Generation

Contact Info

Product

Resources

About