Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1258
|View full text |Cite
|
Sign up to set email alerts
|

Multilingual Speech Evaluation: Case Studies on English, Malay and Tamil

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(10 citation statements)
references
References 0 publications
0
10
0
Order By: Relevance
“…Over the past few decades, extensive research has been conducted on spoken fluency scoring. Traditionally, handcrafted features such as the statistics of speech break [6], speech rate [6,7,[12][13][14], filled pause, and goodness of pronunciation (GOP) [7][8][9] were collected based on phone boundaries and fed into various fluency scorers such as SVM [12,14], and multiple linear [6]. Recent works have employed sequence models to directly learn utterance-level fluency scores from phonelevel raw features, including phonetic features (e.g., phone sequence [7][8][9][10]), prosodic features (e.g., energy [9], pitch [7] and phone duration [10].…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…Over the past few decades, extensive research has been conducted on spoken fluency scoring. Traditionally, handcrafted features such as the statistics of speech break [6], speech rate [6,7,[12][13][14], filled pause, and goodness of pronunciation (GOP) [7][8][9] were collected based on phone boundaries and fed into various fluency scorers such as SVM [12,14], and multiple linear [6]. Recent works have employed sequence models to directly learn utterance-level fluency scores from phonelevel raw features, including phonetic features (e.g., phone sequence [7][8][9][10]), prosodic features (e.g., energy [9], pitch [7] and phone duration [10].…”
Section: Related Workmentioning
confidence: 99%
“…Traditionally, handcrafted features such as the statistics of speech break [6], speech rate [6,7,[12][13][14], filled pause, and goodness of pronunciation (GOP) [7][8][9] were collected based on phone boundaries and fed into various fluency scorers such as SVM [12,14], and multiple linear [6]. Recent works have employed sequence models to directly learn utterance-level fluency scores from phonelevel raw features, including phonetic features (e.g., phone sequence [7][8][9][10]), prosodic features (e.g., energy [9], pitch [7] and phone duration [10]. Bi-directional Long Short Term Memory (BLSTM) [7,10,11,15] and Transformer models [8,9] have been used to capture the dynamic changes of phone-level pronunciation-related features for better modeling the evolution of local fluency over time.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…The word pronunciation score in each of them has been labeled by five experts, the median scores were adopted following the score files coming with the database, ranging from 0 to 10. Following the calculation method of previous study [30], the averaged inter-rater agreement is 0.726. In addition, linguistic experts in Bytedance collected a small amount of task-related unlabeled data (e.g., a group of Chinese adults are required to read aloud given English prompts).…”
Section: Experimental Setup 31 Speech Corpusmentioning
confidence: 99%