An ASR-free Fluency Scoring Approach with Self-Supervised Learning

Liu, Wei; Fu, Kaiqi; Tian, Xin; Shi, Shuju; Li, Wei; Ma, Zhuo; Lee, Tan

doi:10.48550/arxiv.2302.09928

Cited by 1 publication

(6 citation statements)

References 21 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent works have employed sequence models to directly learn utterance-level fluency scores from phonelevel raw features, including phonetic features (e.g., phone sequence [7][8][9][10]), prosodic features (e.g., energy [9], pitch [7] and phone duration [10]. Bi-directional Long Short Term Memory (BLSTM) [7,10,11,15] and Transformer models [8,9] have been used to capture the dynamic changes of phone-level pronunciation-related features for better modeling the evolution of local fluency over time.…”

Section: Related Workmentioning

confidence: 99%

“…More recently, self-supervised learning (SSL)-based speech models such as wav2vec2 [25] have been shown to be effective in learning meaningful representations from raw speech signals in various downstream tasks [26]. Inspired by this success, researchers used pre-trained SSL models like wav2vec2 [25], Hu-BERT [27], and WavLM [28] to extract features directly and feed them into fluency scorers [9,11,15]. Due to the promising performance, we consider the two SSL-based models [9,15] as strong baselines of this work.…”

Section: Related Workmentioning

confidence: 99%

“…Inspired by this success, researchers used pre-trained SSL models like wav2vec2 [25], Hu-BERT [27], and WavLM [28] to extract features directly and feed them into fluency scorers [9,11,15]. Due to the promising performance, we consider the two SSL-based models [9,15] as strong baselines of this work.…”

Section: Related Workmentioning

confidence: 99%

“…Automatic scoring of fluency, serves as an essential module in computer-aided language learning (CALL) systems. It has been extensively studied in both "read aloud" [6][7][8][9][10][11] and "open response" [12][13][14][15] scenarios. In the read aloud" scenario, L2 learners are required to read a provided prompt text, whereas the 'open response" requires them to express their opinions freely based on a given question.…”

Section: Introductionmentioning

confidence: 99%

“…Fluency related features are then extracted and fed into subsequent fluency scorers. Although recent end-to-end neural network based fluency scorers have achieved satisfactory results [7][8][9][10][11]15], their performances heavily rely on the size of labeled scoring samples. In fact, the non-native data labeling process is costly and has scalability issues [16].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Using Fluency Representation Learned from Sequential Raw Features for Improving Non-native Fluency Scoring

Fu¹,

Gao²,

Tian³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Speech fluency/disfluency can be evaluated by analyzing a range of phonetic and prosodic features. Deep neural networks are commonly trained to map fluency-related features into the human scores. However, the effectiveness of deep learning-based models is constrained by the limited amount of labeled training samples. To address this, we introduce a self-supervised learning (SSL) approach that takes into account phonetic and prosody awareness for fluency scoring. Specifically, we first pre-train the model using a reconstruction loss function, by masking phones and their durations jointly on a large amount of unlabeled speech and text prompts. We then fine-tune the pre-trained model using human-annotated scoring data. Our experimental results, conducted on datasets such as Speechocean762 and our non-native datasets, show that our proposed method outperforms the baseline systems in terms of Pearson correlation coefficients (PCC). Moreover, we also conduct an ablation study to better understand the contribution of phonetic and prosody factors during the pre-training stage.

show abstract