Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

Kim, Eesung; Jeon, Jae Hyun; Seo, Hyeji; Kim, Hoon

doi:10.21437/interspeech.2022-10245

Cited by 15 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent works have employed sequence models to directly learn utterance-level fluency scores from phonelevel raw features, including phonetic features (e.g., phone sequence [7][8][9][10]), prosodic features (e.g., energy [9], pitch [7] and phone duration [10]. Bi-directional Long Short Term Memory (BLSTM) [7,10,11,15] and Transformer models [8,9] have been used to capture the dynamic changes of phone-level pronunciation-related features for better modeling the evolution of local fluency over time.…”

Section: Related Workmentioning

confidence: 99%

“…More recently, self-supervised learning (SSL)-based speech models such as wav2vec2 [25] have been shown to be effective in learning meaningful representations from raw speech signals in various downstream tasks [26]. Inspired by this success, researchers used pre-trained SSL models like wav2vec2 [25], Hu-BERT [27], and WavLM [28] to extract features directly and feed them into fluency scorers [9,11,15]. Due to the promising performance, we consider the two SSL-based models [9,15] as strong baselines of this work.…”

Section: Related Workmentioning

confidence: 99%

“…Automatic scoring of fluency, serves as an essential module in computer-aided language learning (CALL) systems. It has been extensively studied in both "read aloud" [6][7][8][9][10][11] and "open response" [12][13][14][15] scenarios. In the read aloud" scenario, L2 learners are required to read a provided prompt text, whereas the 'open response" requires them to express their opinions freely based on a given question.…”

Section: Introductionmentioning

confidence: 99%

“…Fluency related features are then extracted and fed into subsequent fluency scorers. Although recent end-to-end neural network based fluency scorers have achieved satisfactory results [7][8][9][10][11]15], their performances heavily rely on the size of labeled scoring samples. In fact, the non-native data labeling process is costly and has scalability issues [16].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Using Fluency Representation Learned from Sequential Raw Features for Improving Non-native Fluency Scoring

Fu¹,

Gao²,

Tian³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Speech fluency/disfluency can be evaluated by analyzing a range of phonetic and prosodic features. Deep neural networks are commonly trained to map fluency-related features into the human scores. However, the effectiveness of deep learning-based models is constrained by the limited amount of labeled training samples. To address this, we introduce a self-supervised learning (SSL) approach that takes into account phonetic and prosody awareness for fluency scoring. Specifically, we first pre-train the model using a reconstruction loss function, by masking phones and their durations jointly on a large amount of unlabeled speech and text prompts. We then fine-tune the pre-trained model using human-annotated scoring data. Our experimental results, conducted on datasets such as Speechocean762 and our non-native datasets, show that our proposed method outperforms the baseline systems in terms of Pearson correlation coefficients (PCC). Moreover, we also conduct an ablation study to better understand the contribution of phonetic and prosody factors during the pre-training stage.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Using Fluency Representation Learned from Sequential Raw Features for Improving Non-native Fluency Scoring

Fu¹,

Gao²,

Tian³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…Self-supervised learning (SSL) has recently shown promising results in speech processing applications [9,10,11,12,13,14]. SSL can learn rich speech representations without transcription labels by training on massive unlabeled audio data.…”

Section: Introductionmentioning

confidence: 99%

Exploring Speech Representations for Proficiency Assessment in Language Learning

Islam,

Park,

Hain

2023

9th Workshop on Speech and Language Technology in Education (SLaTE)

View full text Add to dashboard Cite

Automatic proficiency assessment can be a useful tool in language learning, for self-evaluation of language skills and to enable educators to tailor instruction effectively. Often assessment methods use categorisation approaches. In this paper an exemplar based approach is chosen, and comparisons between utterances are made using different speech encodings. Such an approach has the advantage to avoid formal categorisation of errors by experts. Aside from a standard spectral representation pretrained model embeddings are investigated for the usefulness for this task. Experiments are conducted using speechocean762 database, which provides 3 levels of proficiency. Data was clustered and performance of different representations is assessed in terms of cluster purity as well as categorisation correctness. Cosine distance with Whisper representations yielded better clustering performance.

show abstract

Automatic Pronunciation Assessment of Non-native English Based on Phonological Analysis

Rios-Urrego,

Escobar-Grisales,

Moreno-Acevedo

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

Cited by 15 publications

References 0 publications

Using Fluency Representation Learned from Sequential Raw Features for Improving Non-native Fluency Scoring

Using Fluency Representation Learned from Sequential Raw Features for Improving Non-native Fluency Scoring

Exploring Speech Representations for Proficiency Assessment in Language Learning

Automatic Pronunciation Assessment of Non-native English Based on Phonological Analysis

Contact Info

Product

Resources

About