Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1267
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Pronunciation Evaluation of Singing

Abstract: In this work, we develop a strategy to automatically evaluate pronunciation of singing. We apply singing-adapted automatic speech recognizer (ASR) in a two-stage approach for evaluating pronunciation of singing. First, we force-align the lyrics with the sung utterances to obtain the word boundaries. We improve the word boundaries by a novel lexical modification technique. Second, we investigate the performance of the phonetic posteriorgram (PPG) based template independent and dependent methods for scoring the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
20
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
7
2

Relationship

2
7

Authors

Journals

citations
Cited by 17 publications
(20 citation statements)
references
References 28 publications
0
20
0
Order By: Relevance
“…The baseline acoustic model is trained using 40-dimensional MFCCs as acoustic features that are combined with i-vectors [37]. During the training of the neural network [38], the frame subsampling rate is set to 3 providing an effective frame shift of 30 ms. A duration-based modified pronunciation lexicon is employed which is detailed in [10].…”
Section: Asr Architecturementioning
confidence: 99%
“…The baseline acoustic model is trained using 40-dimensional MFCCs as acoustic features that are combined with i-vectors [37]. During the training of the neural network [38], the frame subsampling rate is set to 3 providing an effective frame shift of 30 ms. A duration-based modified pronunciation lexicon is employed which is detailed in [10].…”
Section: Asr Architecturementioning
confidence: 99%
“…In total, 3913 songs are used for training. Acoustic modeling and alignment are done using the open source speech recognition toolkit Kaldi [21] with a duration-based pronunciation lexicon for singing voice [22]. The performance seems to rely on a very large beam width during Viterbi decoding [23] as mentioned in the previous work [24] which is computationally expensive.…”
Section: A Lyrics Alignmentmentioning
confidence: 99%
“…Some work took advantage of the characteristics of music itself: Gupta, Li, and Wang (2018) extended the length of pronounced vowels in output sequences by increasing the probability of a frame with the same phoneme after a certain vowel frame because there are a lot of long vowels in singing voice. Kruspe and Fraunhofer (2016) boosted the ALT system by using the newly generated alignment (Mesaros and Virtanen 2008) of singing and lyrics.…”
Section: Automatic Lyrics Transcriptionmentioning
confidence: 99%