Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1110
|View full text |Cite
|
Sign up to set email alerts
|

L2-ARCTIC: A Non-native English Speech Corpus

Abstract: In this paper, we introduce L2-ARCTIC, a speech corpus of non-native English that is intended for research in voice conversion, accent conversion, and mispronunciation detection. This initial release includes recordings from ten non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, and Arabic, each L1 containing recordings from one male and one female speaker. Each speaker recorded approximately one hour of read speech from the Carnegie Mellon University ARCTIC prompt… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
39
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 86 publications
(40 citation statements)
references
References 31 publications
0
39
0
Order By: Relevance
“…The AM has five hidden layers and an output layer with 5816 senones. We trained the PPG-to-Mel and WaveGlow models on two non-native speakers, YKWK (native male Korean speaker) and ZHAA (native female Arabic speaker) from the publicly-available L2-ARCTIC corpus [34]. We applied noise reduction on the original L2-ARCTIC recordings using Audacity [36] to remove ambient background noise.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The AM has five hidden layers and an output layer with 5816 senones. We trained the PPG-to-Mel and WaveGlow models on two non-native speakers, YKWK (native male Korean speaker) and ZHAA (native female Arabic speaker) from the publicly-available L2-ARCTIC corpus [34]. We applied noise reduction on the original L2-ARCTIC recordings using Audacity [36] to remove ambient background noise.…”
Section: Methodsmentioning
confidence: 99%
“…The original Tacotron 2 was designed to accept character sequences as input, which are significantly shorter than our PPG sequences. For example, each sentence in our speech corpus [34] contains an average of 41 characters, whereas the PPG sequence has a few hundred frames. Therefore, the original Tacotron 2 attention mechanism would be confused by such long input sequences and cause misalignment between the PPG and acoustic sequences, as pointed out in [15].…”
Section: Ppg-to-mel-spectrogram Conversionmentioning
confidence: 99%
“…Since the pre-trained audio transformer models are used in a variety of domains, we tested them on: LibriSpeech dataset, which is a dataset comprising of read speech by native English speakers (Panayotov et al, 2015), spontaneous native English speech by using Mozilla Common Voice dataset (Ardila et al, 2019), and speakers with English as their second language (L2 English speakers) by using L2-Arctic (Zhao et al, 2018).…”
Section: Problem Definition and Datamentioning
confidence: 99%
“…The AM had five hidden layers and a final softmax layer that produced a 5816-dimensional PPGs. We trained the PPG-to-Mel and WaveGlow models on two L2 speakers from the publicly-available L2-ARCTIC corpus [7]: ABA (male Arabic speaker) and EBVS (male Spanish speaker), and two L1 speakers from the ARCTIC corpus [8]: BDL (male American English) and RMS (male American English). Each speaker in L2-ARCTIC and ARCTIC recorded the same set of 1,132 sentences, or about an hour of speech.…”
Section: Speech Corpusmentioning
confidence: 99%
“…We performed accent-conversion experiments on pairs of L2 and L1 speakers from the L2-ARCTIC [7] and ARCTIC corpora [8], respectively. For each speaker pair, we generated accent conversions in both directions: L2 speaker with L1 accent, and L1 speaker with L2 accent.…”
Section: Introductionmentioning
confidence: 99%