2023
DOI: 10.48550/arxiv.2302.09928
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

An ASR-free Fluency Scoring Approach with Self-Supervised Learning

Abstract: A typical fluency scoring system generally relies on an automatic speech recognition (ASR) system to obtain time stamps in input speech for either the subsequent calculation of fluency-related features or directly modeling speech fluency with an end-to-end approach. This paper describes a novel ASR-free approach for automatic fluency assessment using self-supervised learning (SSL). Specifically, wav2vec2.0 is used to extract frame-level speech features, followed by K-means clustering to assign a pseudo label (… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
6
0

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(6 citation statements)
references
References 21 publications
(35 reference statements)
0
6
0
Order By: Relevance
“…Recent works have employed sequence models to directly learn utterance-level fluency scores from phonelevel raw features, including phonetic features (e.g., phone sequence [7][8][9][10]), prosodic features (e.g., energy [9], pitch [7] and phone duration [10]. Bi-directional Long Short Term Memory (BLSTM) [7,10,11,15] and Transformer models [8,9] have been used to capture the dynamic changes of phone-level pronunciation-related features for better modeling the evolution of local fluency over time.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Recent works have employed sequence models to directly learn utterance-level fluency scores from phonelevel raw features, including phonetic features (e.g., phone sequence [7][8][9][10]), prosodic features (e.g., energy [9], pitch [7] and phone duration [10]. Bi-directional Long Short Term Memory (BLSTM) [7,10,11,15] and Transformer models [8,9] have been used to capture the dynamic changes of phone-level pronunciation-related features for better modeling the evolution of local fluency over time.…”
Section: Related Workmentioning
confidence: 99%
“…More recently, self-supervised learning (SSL)-based speech models such as wav2vec2 [25] have been shown to be effective in learning meaningful representations from raw speech signals in various downstream tasks [26]. Inspired by this success, researchers used pre-trained SSL models like wav2vec2 [25], Hu-BERT [27], and WavLM [28] to extract features directly and feed them into fluency scorers [9,11,15]. Due to the promising performance, we consider the two SSL-based models [9,15] as strong baselines of this work.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations