Exploiting automatic speech recognition errors to enhance partial and synchronized caption for facilitating second language listening

Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya

doi:10.1016/j.csl.2017.11.001

Cited by 16 publications

(7 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The deep learning models used were inspired by state-of-the-art automatic speech recognition (ASR) networks (Amodei et al, 2015). ASR systems without language models are error prone when detecting the canonical structure of resyllabified sequences (Adda-Decker et al, 2002;Mirzaei et al, 2018;Wu et al, 1997). For example, a sequence like "fade out" could be recognised as "Fay doubt" if the coda /d/ is resyllabified as the onset of the second syllable.…”

Section: B Using Deep Neural Network With Acoustic Data To Identify R...mentioning

confidence: 99%

Deep learning assessment of syllable affiliation of intervocalic consonants

Liu

2023

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

In English, a sentence like “He made out our intentions.” could be misperceived as “He may doubt our intentions.” because the coda /d/ sounds like it has become the onset of the next syllable. The nature and occurrence condition of this resyllabification phenomenon are unclear, however. Previous empirical studies mainly relied on listener judgment, limited acoustic evidence, such as voice onset time, or average formant values to determine the occurrence of resyllabification. This study tested the hypothesis that resyllabification is a coarticulatory reorganisation that realigns the coda consonant with the vowel of the next syllable. Deep learning in conjunction with dynamic time warping (DTW) was used to assess syllable affiliation of intervocalic consonants. The results suggest that convolutional neural network- and recurrent neural network-based models can detect cases of resyllabification using Mel-frequency spectrograms. DTW analysis shows that neural network inferred resyllabified sequences are acoustically more similar to their onset counterparts than their canonical productions. A binary classifier further suggests that, similar to the genuine onsets, the inferred resyllabified coda consonants are coarticulated with the following vowel. These results are interpreted with an account of resyllabification as a speech-rate-dependent coarticulatory reorganisation mechanism in speech.

show abstract

Section: B Using Deep Neural Network With Acoustic Data To Identify R...mentioning

confidence: 99%

Deep learning assessment of syllable affiliation of intervocalic consonants

Liu

2023

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

show abstract

“…In another direction, it is worth highlighting the work from (Mirzaei et al 2018), which takes the output of automatic speech recognition and analyses the errors committed to estimating the difficulties in L2 speech. Among the most common types of errors, the authors identify homophones, minimal pairs, negatives and breached boundaries.…”

Section: Phonetic Assessmentmentioning

confidence: 99%

Automatic Speech Recognition in L2 Learning: A Review Based on PRISMA Methodology

Farrús

2023

Languages

View full text Add to dashboard Cite

The language learning field is not exempt from benefiting from the most recent techniques that have revolutionised the field of speech technologies. L2 learning, especially when it comes to learning some of the most spoken languages in the world, is increasingly including more and more automated methods to assess linguistics aspects and provide feedback to learners, especially on pronunciation issues. On the one hand, only a few of these systems integrate automatic speech recognition as a helping tool for pronunciation assessment. On the other hand, most of the computer-assisted language pronunciation tools focus on the segmental level of the language, providing feedback on specific phonetic pronunciation, and disregarding the suprasegmental features based on intonation, among others. The current review, based on the PRISMA methodology for systematic reviews, overviews the existing tools for L2 learning, classifying them in terms of the assessment level, (grammatical, lexical, phonetic, and prosodic), and trying the explain why so few tools are nowadays dedicated to evaluate the intonational aspect. Moreover, the review also addresses the existing commercial systems, as well as the existing gap between those tools and the research developed in this area. Finally, the manuscript finishes with a discussion of the main findings and foresees future lines of research.

show abstract

“…For the baseline version, we used rule-based coarse-grained level assignments to roughly categorize learners into three language proficiency levels (beginners, intermediate, advanced) based on learner's assessment tests (TOEFL/TOEIC score, speech rate tolerance, vocabulary size). Word selection is determined by defining thresholds for specific features, including word frequency and speech rate, while also incorporating additional factors like automatic speech recognition system errors, word specificity, proper names, and abbreviations (Mirzaei et al, 2018). and requirements.…”

Section: Personalized Captionmentioning

confidence: 99%

EuroCALL 2023. CALL for all Languages - Short Papers

2023

View full text Add to dashboard Cite

show abstract

Exploiting automatic speech recognition errors to enhance partial and synchronized caption for facilitating second language listening

Cited by 16 publications

References 30 publications

Deep learning assessment of syllable affiliation of intervocalic consonants

Deep learning assessment of syllable affiliation of intervocalic consonants

Automatic Speech Recognition in L2 Learning: A Review Based on PRISMA Methodology

EuroCALL 2023. CALL for all Languages - Short Papers

Contact Info

Product

Resources

About