In this paper, we introduce L2-ARCTIC, a speech corpus of non-native English that is intended for research in voice conversion, accent conversion, and mispronunciation detection. This initial release includes recordings from ten non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, and Arabic, each L1 containing recordings from one male and one female speaker. Each speaker recorded approximately one hour of read speech from the Carnegie Mellon University ARCTIC prompts, from which we generated orthographic and forced-aligned phonetic transcriptions. In addition, we manually annotated 150 utterances per speaker to identify three types of mispronunciation errors: substitutions, deletions, and additions, making it a valuable resource not only for research in voice conversion and accent conversion but also in computer-assisted pronunciation training. The corpus is publicly accessible at https://psi.engr.tamu.edu/l2-arctic-corpus/.
Methods for foreign accent conversion (FAC) aim to generate speech that sounds similar to a given non-native speaker but with the accent of a native speaker. Conventional FAC methods borrow excitation information (F0 and aperiodicity; produced by a conventional vocoder) from a reference (i.e., native) utterance during synthesis time. As such, the generated speech retains some aspects of the voice quality of the native speaker. We present a framework for FAC that eliminates the need for conventional vocoders (e.g., STRAIGHT, World) and therefore the need to use the native speaker's excitation. Our approach uses an acoustic model trained on a native speech corpus to extract speaker-independent phonetic posteriorgrams (PPGs), and then train a speech synthesizer to map PPGs from the non-native speaker into the corresponding spectral features, which in turn are converted into the audio waveform using a high-quality neural vocoder. At runtime, we drive the synthesizer with the PPG extracted from a native reference utterance. Listening tests show that the proposed system produces speech that sounds more clear, natural, and similar to the non-native speaker compared with a baseline system, while significantly reducing the perceived foreign accent of nonnative utterances.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.