Recently, researchers' attention has been paid to pronunciation assessment not based on comparison between learners' utterances and native models, but based on comprehensibility of the utterances [1, 2, 3]. In our previous studies [4, 5], native listeners' shadowing was investigated and shown to be effective to predict comprehensibility perceived by listeners (shadowers). In this paper, native listeners' shadowings are viewed as spoken annotations that can represent comprehensibility. In [4, 5], to predict comprehensibility of a non-native utterance, the GOP scores of its corresponding native listeners' shadowings were calculated by using a DNN-based ASR front-end. Generally speaking, annotations are prepared manually and, even when some techniques are used for annotations, only stable and reliable techniques should be used. In this paper, a simpler, stabler, and more reliable method to derive comprehensibility annotations was proposed. After native listeners' shadowing, they are asked to read aloud the sentence intended by the learner. Reading is the most prepared speech and shadowing is probably the least prepared speech. DTW between the two utterances is supposed to be able to quantify and predict comprehensibility or shadowability perceived by the shadowers. In experiments, DTW between shadowings and readings shows higher correlation than the GOP scores of shadowings.
Recently, researchers' attention has been paid to pronunciation assessment not based on comparison between L2 speech and native models, but based on comprehensibility of L2 speech [1, 2, 3]. In our previous studies [4, 5, 6], native listeners' shadowing of L2 speech was examined and it was shown that delay of shadowing and accuracy of articulation in shadowing utterances, both of which were acoustically calculated, are strongly influenced by the amount of cognitive load imposed for understanding L2 speech, especially when it is with strong accents. In this paper, aside from acoustic analysis of shadowings, we focus on shadowers' facial microexpressions and examine how they are correlated with perceived comprehensibility. To extract facial expression features, two methods are tested. One is a computer-vision-based method and recorded videos of shadowers' facial expressions are analyzed. The other is a method using a physiological sensor that can detect subtle movements of facial muscles. In experiments, four shadowers' facial expressions are analyzed, each of whom shadowed approximately 800 L2 utterances. Results show that some of shadowers' facial expressions are highly correlated with perceived comprehensibility, and that those facial expressions are strongly shadowerdependent. These results indicate a high potential of shadowers' facial expressions for comprehensibility prediction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.