Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10245
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 15 publications
(5 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…Recent works have employed sequence models to directly learn utterance-level fluency scores from phonelevel raw features, including phonetic features (e.g., phone sequence [7][8][9][10]), prosodic features (e.g., energy [9], pitch [7] and phone duration [10]. Bi-directional Long Short Term Memory (BLSTM) [7,10,11,15] and Transformer models [8,9] have been used to capture the dynamic changes of phone-level pronunciation-related features for better modeling the evolution of local fluency over time.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…Recent works have employed sequence models to directly learn utterance-level fluency scores from phonelevel raw features, including phonetic features (e.g., phone sequence [7][8][9][10]), prosodic features (e.g., energy [9], pitch [7] and phone duration [10]. Bi-directional Long Short Term Memory (BLSTM) [7,10,11,15] and Transformer models [8,9] have been used to capture the dynamic changes of phone-level pronunciation-related features for better modeling the evolution of local fluency over time.…”
Section: Related Workmentioning
confidence: 99%
“…More recently, self-supervised learning (SSL)-based speech models such as wav2vec2 [25] have been shown to be effective in learning meaningful representations from raw speech signals in various downstream tasks [26]. Inspired by this success, researchers used pre-trained SSL models like wav2vec2 [25], Hu-BERT [27], and WavLM [28] to extract features directly and feed them into fluency scorers [9,11,15]. Due to the promising performance, we consider the two SSL-based models [9,15] as strong baselines of this work.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Self-supervised learning (SSL) has recently shown promising results in speech processing applications [9,10,11,12,13,14]. SSL can learn rich speech representations without transcription labels by training on massive unlabeled audio data.…”
Section: Introductionmentioning
confidence: 99%