Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation

Gao, Yongwei; Zhang, Xulong; Li, Wei

doi:10.3390/electronics10030298

Cited by 25 publications

(8 citation statements)

References 18 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, a frequency-temporal attention module was introduced in [19] to learn the relevant regions for predictions. Some special representations are proposed including HCQT [7], a combination of frequency and periodicity [20], and source-separated tracks [21,22].…”

Section: Related Workmentioning

confidence: 99%

SpecTNT: a Time-Frequency Transformer for Music Audio

Lu¹,

Wang²,

Won³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformers have drawn attention in the MIR field for their remarkable performance shown in natural language processing and computer vision. However, prior works in the audio processing domain mostly use Transformer as a temporal feature aggregator that acts similar to RNNs. In this paper, we propose SpecTNT, a Transformerbased architecture to model both spectral and temporal sequences of an input time-frequency representation. Specifically, we introduce a novel variant of the Transformer-in-Transformer (TNT) architecture. In each SpecTNT block, a spectral Transformer extracts frequency-related features into the frequency class token (FCT) for each frame. Later, the FCTs are linearly projected and added to the temporal embeddings (TEs), which aggregate useful information from the FCTs. Then, a temporal Transformer processes the TEs to exchange information across the time axis. By stacking the SpecTNT blocks, we build the SpecTNT model to learn the representation for music signals. In experiments, SpecTNT demonstrates state-of-the-art performance in music tagging and vocal melody extraction, and shows competitive performance for chord recognition. The effectiveness of SpecTNT and other design choices are further examined through ablation studies.

show abstract

Section: Related Workmentioning

confidence: 99%

SpecTNT: a Time-Frequency Transformer for Music Audio

Lu¹,

Wang²,

Won³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…SID is used in music library management to address the classification of songs by singers. Furthermore, the SID model is able to be used for downstream singing-related applications, such as similarity search, playlist generation, or song synthesis [4]- [9].…”

Section: Introductionmentioning

confidence: 99%

Singer Identification for Metaverse with Timbral and Middle-Level Perceptual Features

Zhang¹,

Wang²,

Cheng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Metaverse is an interactive world that combines reality and virtuality, where participants can be virtual avatars. Anyone can hold a concert in a virtual concert hall, and users can quickly identify the real singer behind the virtual idol through the singer identification. Most singer identification methods are processed using the frame-level features. However, expect the singer's timbre, the music frame includes music information, such as melodiousness, rhythm, and tonal. It means the music information is noise for using frame-level features to identify the singers. In this paper, instead of only the frame-level features, we propose to use another two features that address this problem. Middle-level feature, which represents the music's melodiousness, rhythmic stability, and tonal stability, and is able to capture the perceptual features of music. The timbre feature, which is used in speaker identification, represents the singers' voice features. Furthermore, we propose a convolutional recurrent neural network (CRNN) to combine three features for singer identification. The model firstly fuses the frame-level feature and timbre feature and then combines middle-level features to the mix features. In experiments, the proposed method achieves comparable performance on an average F1 score of 0.81 on the benchmark dataset of Artist20, which significantly improves related works.

show abstract

“…The task of singing voice synthesis is similar to the text-to-speech (TTS) in speech processing, and the synthesis speech is generated according to the given text. With the development of text-to-speech technology, many technologies [1]- [7] have been successfully applied to the task of singing voice synthesis. Both of the tasks of TTS and SVS encoded the lyrics or text into an acoustic variable, through a vocoder to synthesize the audio waveform.…”

Section: Introductionmentioning

confidence: 99%

SUSing: SU-net for Singing Voice Synthesis

Zhang¹,

Wang²,

Cheng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Singing voice synthesis is a generative task that involves multi-dimensional control of the singing model, including lyrics, pitch, and duration, and includes the timbre of the singer and singing skills such as vibrato. In this paper, we proposed SU-net for singing voice synthesis named SUSing. Synthesizing singing voice is treated as a translation task between lyrics and music score and spectrum. The lyrics and music score information is encoded into a two-dimensional feature representation through the convolution layer. The two-dimensional feature and its frequency spectrum are mapped to the target spectrum in an autoregressive manner through a SU-net network. Within the SU-net the stripe pooling method is used to replace the alternate global pooling method to learn the vertical frequency relationship in the spectrum and the changes of frequency in the time domain. The experimental results on the public dataset Kiritan show that the proposed method can synthesize more natural singing voices.

show abstract

Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation

Cited by 25 publications

References 18 publications

SpecTNT: a Time-Frequency Transformer for Music Audio

SpecTNT: a Time-Frequency Transformer for Music Audio

Singer Identification for Metaverse with Timbral and Middle-Level Perceptual Features

SUSing: SU-net for Singing Voice Synthesis

Contact Info

Product

Resources

About