Automatic Pronunciation Evaluation of Singing

Gupta, Chitralekha; Li, Haizhou; Wang, Ye

doi:10.21437/interspeech.2018-1267

Cited by 17 publications

(20 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The baseline acoustic model is trained using 40-dimensional MFCCs as acoustic features that are combined with i-vectors [37]. During the training of the neural network [38], the frame subsampling rate is set to 3 providing an effective frame shift of 30 ms. A duration-based modified pronunciation lexicon is employed which is detailed in [10].…”

Section: Asr Architecturementioning

confidence: 99%

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Gupta

Yılmaz

2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

Automatic lyrics to polyphonic audio alignment is a challenging task not only because the vocals are corrupted by background music, but also there is a lack of annotated polyphonic corpus for effective acoustic modeling. In this work, we propose (1) using additional speech and music-informed features and (2) adapting the acoustic models trained on a large amount of solo singing vocals towards polyphonic music using a small amount of in-domain data. Incorporating additional information such as voicing and auditory features together with conventional acoustic features aims to bring robustness against the increased spectro-temporal variations in singing vocals. By adapting the acoustic model using a small amount of polyphonic audio data, we reduce the domain mismatch between training and testing data. We perform several alignment experiments and present an in-depth alignment error analysis on acoustic features, and model adaptation techniques. The results demonstrate that the proposed strategy provides a significant error reduction of word boundary alignment over comparable existing systems, especially on more challenging polyphonic data with long-duration musical interludes.

show abstract

Section: Asr Architecturementioning

confidence: 99%

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Gupta

Yılmaz

2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…In total, 3913 songs are used for training. Acoustic modeling and alignment are done using the open source speech recognition toolkit Kaldi [21] with a duration-based pronunciation lexicon for singing voice [22]. The performance seems to rely on a very large beam width during Viterbi decoding [23] as mentioned in the previous work [24] which is computationally expensive.…”

Section: A Lyrics Alignmentmentioning

confidence: 99%

Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation

Schulze-Forster

Doire

Richard

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

The goal of singing voice separation is to recover the vocals signal from music mixtures. State-of-the-art performance is achieved by deep neural networks trained in a supervised fashion. Since training data are scarce and music signals are extremely diverse, it remains challenging to achieve high separation quality across various recording and mixing conditions as well as music styles. In this paper, we investigate to which extent the separation can be improved when lyrics transcripts are used as additional information. To this end, we propose a joint approach to phoneme level lyrics alignment and text-informed singing voice separation. It is based on DTW-attention, a new monotonic attention mechanism including a differentiable approximation of dynamic time warping. Experimental results show that the method can align phonemes with mixed singing voice with high precision given accurate transcripts. It also achieves competitive results on challenging word level alignment test sets using less training data than state-of-the-art methods. Sequential alignment and informed separation lead to improved separation quality according to objective measures. Text information helps preserving spectral phoneme properties in the separated voice signals.

show abstract

“…Some work took advantage of the characteristics of music itself: Gupta, Li, and Wang (2018) extended the length of pronounced vowels in output sequences by increasing the probability of a frame with the same phoneme after a certain vowel frame because there are a lot of long vowels in singing voice. Kruspe and Fraunhofer (2016) boosted the ALT system by using the newly generated alignment (Mesaros and Virtanen 2008) of singing and lyrics.…”

Section: Automatic Lyrics Transcriptionmentioning

confidence: 99%

PDAugment: Data Augmentation by Pitch and Duration Adjustments for Automatic Lyrics Transcription

Zhang¹,

Yu²,

Chang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Automatic lyrics transcription (ALT), which can be regarded as automatic speech recognition (ASR) on singing voice, is an interesting and practical topic in academia and industry. ALT has not been well developed mainly due to the dearth of paired singing voice and lyrics datasets for model training.Considering that there is a large amount of ASR training data, a straightforward method is to leverage ASR data to enhance ALT training. However, the improvement is marginal when training the ALT system directly with ASR data, because of the gap between the singing voice and standard speech data which is rooted in music-specific acoustic characteristics in singing voice. In this paper, we propose PDAugment, a data augmentation method that adjusts pitch and duration of speech at syllable level under the guidance of music scores to help ALT training. Specifically, we adjust the pitch and duration of each syllable in natural speech to those of the corresponding note extracted from music scores, so as to narrow the gap between natural speech and singing voice. Experiments on DSing30 and Dali corpus show that the ALT system equipped with our PDAugment outperforms previous stateof-the-art systems by 5.9% and 18.1% WERs respectively, demonstrating the effectiveness of PDAugment for ALT.

show abstract

Automatic Pronunciation Evaluation of Singing

Cited by 17 publications

References 28 publications

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation

PDAugment: Data Augmentation by Pitch and Duration Adjustments for Automatic Lyrics Transcription

Contact Info

Product

Resources

About