L2-ARCTIC: A Non-native English Speech Corpus

Zhao, Guanlong; Sonsaat, Sinem; Silpachai, Alif; Lučić, Ivana; Chukharev‐Hudilainen, Evgeny; Levis, John M.; Gutierrez‐Osuna, Ricardo

doi:10.21437/interspeech.2018-1110

Cited by 86 publications

(40 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The AM has five hidden layers and an output layer with 5816 senones. We trained the PPG-to-Mel and WaveGlow models on two non-native speakers, YKWK (native male Korean speaker) and ZHAA (native female Arabic speaker) from the publicly-available L2-ARCTIC corpus [34]. We applied noise reduction on the original L2-ARCTIC recordings using Audacity [36] to remove ambient background noise.…”

Section: Methodsmentioning

confidence: 99%

“…The original Tacotron 2 was designed to accept character sequences as input, which are significantly shorter than our PPG sequences. For example, each sentence in our speech corpus [34] contains an average of 41 characters, whereas the PPG sequence has a few hundred frames. Therefore, the original Tacotron 2 attention mechanism would be confused by such long input sequences and cause misalignment between the PPG and acoustic sequences, as pointed out in [15].…”

Section: Ppg-to-mel-spectrogram Conversionmentioning

confidence: 99%

See 1 more Smart Citation

Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams

2019

Self Cite

View full text Add to dashboard Cite

Methods for foreign accent conversion (FAC) aim to generate speech that sounds similar to a given non-native speaker but with the accent of a native speaker. Conventional FAC methods borrow excitation information (F0 and aperiodicity; produced by a conventional vocoder) from a reference (i.e., native) utterance during synthesis time. As such, the generated speech retains some aspects of the voice quality of the native speaker. We present a framework for FAC that eliminates the need for conventional vocoders (e.g., STRAIGHT, World) and therefore the need to use the native speaker's excitation. Our approach uses an acoustic model trained on a native speech corpus to extract speaker-independent phonetic posteriorgrams (PPGs), and then train a speech synthesizer to map PPGs from the non-native speaker into the corresponding spectral features, which in turn are converted into the audio waveform using a high-quality neural vocoder. At runtime, we drive the synthesizer with the PPG extracted from a native reference utterance. Listening tests show that the proposed system produces speech that sounds more clear, natural, and similar to the non-native speaker compared with a baseline system, while significantly reducing the perceived foreign accent of nonnative utterances.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Ppg-to-mel-spectrogram Conversionmentioning

confidence: 99%

Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams

2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…Since the pre-trained audio transformer models are used in a variety of domains, we tested them on: LibriSpeech dataset, which is a dataset comprising of read speech by native English speakers (Panayotov et al, 2015), spontaneous native English speech by using Mozilla Common Voice dataset (Ardila et al, 2019), and speakers with English as their second language (L2 English speakers) by using L2-Arctic (Zhao et al, 2018).…”

Section: Problem Definition and Datamentioning

confidence: 99%

What All Do Audio Transformer Models Hear? Probing Acoustic Representations for Language Delivery and Its Structure

Kumar¹,

Shah²,

Shah³

et al. 2021

Preprint

View full text Add to dashboard Cite

In recent times, BERT based transformer models have become an inseparable part of the 'tech stack' of text processing models. Similar progress is being observed in the speech domain with a multitude of models observing state-of-the-art results by using audio transformer models to encode speech. This begs the question of what are these audio transformer models learning. Moreover, although the standard methodology is to choose the last layer embedding for any downstream task, but is it the optimal choice? We try to answer these questions for the two recent audio transformer models, Mockingjay and wave2vec2.0. We compare them on a comprehensive set of language delivery and structure features including audio, fluency and pronunciation features. Additionally, we probe the audio models' understanding of textual surface, syntax, and semantic features and compare them to BERT. We do this over exhaustive settings for native, non-native, synthetic, read and spontaneous speech datasets

show abstract

“…The AM had five hidden layers and a final softmax layer that produced a 5816-dimensional PPGs. We trained the PPG-to-Mel and WaveGlow models on two L2 speakers from the publicly-available L2-ARCTIC corpus [7]: ABA (male Arabic speaker) and EBVS (male Spanish speaker), and two L1 speakers from the ARCTIC corpus [8]: BDL (male American English) and RMS (male American English). Each speaker in L2-ARCTIC and ARCTIC recorded the same set of 1,132 sentences, or about an hour of speech.…”

Section: Speech Corpusmentioning

confidence: 99%

“…We performed accent-conversion experiments on pairs of L2 and L1 speakers from the L2-ARCTIC [7] and ARCTIC corpora [8], respectively. For each speaker pair, we generated accent conversions in both directions: L2 speaker with L1 accent, and L1 speaker with L2 accent.…”

Section: Introductionmentioning

confidence: 99%

Understanding the Effect of Voice Quality and Accent on Talker Similarity

Das¹,

Zhao²,

Levis³

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

This paper presents a methodology to study the role of nonnative accents on talker recognition by humans. The methodology combines a state-of-the-art accent-conversion system to resynthesize the voice of a speaker with a different accent of her/his own, and a protocol for perceptual listening tests to measure the relative contribution of accent and voice quality on speaker similarity. Using a corpus of nonnative and native speakers, we generated accent conversions in two different directions: non-native speakers with native accents, and native speakers with non-native accents. Then, we asked listeners to rate the similarity between 50 pairs of real or synthesized speakers. Using a linear mixed effects model, we find that (for our corpus) the effect of voice quality is five times as large as that of non-native accent, and that the effect goes away when speakers share the same (native) accent. We discuss the potential significance of this work in earwitness identification and sociophonetics.

show abstract

L2-ARCTIC: A Non-native English Speech Corpus

Cited by 86 publications

References 31 publications

Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams

Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams

What All Do Audio Transformer Models Hear? Probing Acoustic Representations for Language Delivery and Its Structure

Understanding the Effect of Voice Quality and Accent on Talker Similarity

Contact Info

Product

Resources

About