2020 28th European Signal Processing Conference (EUSIPCO) 2021
DOI: 10.23919/eusipco47968.2020.9287625
|View full text |Cite
|
Sign up to set email alerts
|

Learning Frame Similarity using Siamese networks for Audio-to-Score Alignment

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
2

Relationship

2
7

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 17 publications
0
7
0
Order By: Relevance
“…However, their method struggles with pieces containing both backward and forward jumps, which is an important challenge we tackle using our progressively dilated convolutional models. Recent work on audio-to-score alignment has demonstrated the efficacy of multimodal embeddings [15], reinforcement learning [16], [17] and learnt frame similarities [18], albeit these are not structure-aware methods. Very recently, Shan et al propose Hierarchical-DTW [19] to automatically generate piano score following videos given an audio and a raw image of sheet music.…”
Section: Related Workmentioning
confidence: 99%
“…However, their method struggles with pieces containing both backward and forward jumps, which is an important challenge we tackle using our progressively dilated convolutional models. Recent work on audio-to-score alignment has demonstrated the efficacy of multimodal embeddings [15], reinforcement learning [16], [17] and learnt frame similarities [18], albeit these are not structure-aware methods. Very recently, Shan et al propose Hierarchical-DTW [19] to automatically generate piano score following videos given an audio and a raw image of sheet music.…”
Section: Related Workmentioning
confidence: 99%
“…IRs allow us to model different recording conditions in the form of microphones and room characteristics. We use more than 500 freely available IRs collected from OpenAIRLib 4 and MicIRP. 5 The IR signal is convolved with the audio on-the-fly during training, such that in each epoch the model encounters different audio scenarios, which should allow for a more robust audio encoding model.…”
Section: A Impulse Response Data Augmentationmentioning
confidence: 99%
“…This is the consequence of the ability of DNNs to learn complex nonlinear mappings through which musical objects can be expected to be better separated [11]. Hence, while DNNs generally learn "deep" features stemming from multiple examples in a training phase, and then evaluate the potential of learned features [12], our SSAE approach focuses on the different events within a single song and tries to learn nonlinear latent representations, used to infer the structure.…”
Section: Introductionmentioning
confidence: 99%