2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2019
DOI: 10.1109/waspaa.2019.8937135
|View full text |Cite
|
Sign up to set email alerts
|

Joint Singing Pitch Estimation and Voice Separation Based on a Neural Harmonic Structure Renderer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 7 publications
(8 citation statements)
references
References 9 publications
0
8
0
Order By: Relevance
“…A voice separation method and a beat-tracking method are used in the preprocessing step in the present method, and we observed that errors made in the preprocessing step can propagate to the transcription results. To mitigate the problem, multi-task learning of the singing voice separation and the AST can also be effective in obtaining the singing voices appropriate for the AST [5]. A beat-tracking method typically estimates beat times in the accompaniment sounds, which can be slightly shifted from the onset times of the singing voice due to the asynchrony between the vocal and the other parts [36].…”
Section: E) Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…A voice separation method and a beat-tracking method are used in the preprocessing step in the present method, and we observed that errors made in the preprocessing step can propagate to the transcription results. To mitigate the problem, multi-task learning of the singing voice separation and the AST can also be effective in obtaining the singing voices appropriate for the AST [5]. A beat-tracking method typically estimates beat times in the accompaniment sounds, which can be slightly shifted from the onset times of the singing voice due to the asynchrony between the vocal and the other parts [36].…”
Section: E) Discussionmentioning
confidence: 99%
“…Inspired by the CNN proposed for frame-level melody F0 estimation [3], the frame-level CNN of the acoustic model (Fig. 5) was designed to have six convolution layers with the output channels of 128, 64, 64, 64, 8, and 1 and the kernel sizes of (5, 5), (5,5), (3,3), (3,3), (70, 3), and (1, 1), respectively, where the instance normalization [31] and the Mish function [32] are used. The output dimension of the tatumlevel BLSTM was set to D = 130 × 2.…”
Section: B) Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…Most recently, Nakano et al [22] and Jansson et al [23] almost at the same time proposed to train the SVS task and the VME task jointly. Both methods obtained promising results.…”
Section: Source Separation-based Vocal Melody Extractionmentioning
confidence: 99%
“…According to the performance of Deep Salience reported in [22], the F0 values estimated by Deep Salience still contain errors, which limits the performance of this method to a certain extent. In [23], the authors designed a differentiable layer that converts an F0 saliency spectrogram into harmonic masks indicating the locations of harmonic partials of a singing voice. However, this system is not robust to the backing vocals, since in the SVS task the backing vocals belong to vocals but in the VME task, the pitches of backing vocals do not belong to the vocal melody.…”
Section: Source Separation-based Vocal Melody Extractionmentioning
confidence: 99%