The new AT&T Text-To-Speech (TTS) system for general U.S. English text is based on best-choice components of the AT&T Flextalk TTS, the Festival System from the University of Edinburgh, and ATR's CHATR system. From Flextalk, it employs text normalization, letter-to-sound, and prosody generation. Festival provides a flexible and modular architecture for easy experimentation and competitive evaluation of different algorithms or modules. In addition, we adopted CHATR's unit selection algorithms and modified them in an attempt to guarantee high intelligibility under all circumstances. Finally, we have added our own Harmonic plus Noise Model (HNM) backend for synthesizing the output speech. Most decisions made during the research and development phase of this system were based on formal subjective evaluations. We feel that the new system goes a long way toward delivering on the long-standing promise of truly natural-sounding, as well as highly intelligible, synthesis.
This paper describes Apple's hybrid unit selection speech synthesis system, which provides the voices for Siri with the requirement of naturalness, personality and expressivity. It has been deployed into hundreds of millions of desktop and mobile devices (e.g. iPhone, iPad, Mac, etc.) via iOS and macOS in multiple languages. The system is following the classical unit selection framework with the advantage of using deep learning techniques to boost the performance. In particular, deep and recurrent mixture density networks are used to predict the target and concatenation reference distributions for respective costs during unit selection. In this paper, we present an overview of the run-time TTS engine and the voice building process. We also describe various techniques that enable on-device capability such as preselection optimization, caching for low latency, and unit pruning for low footprint, as well as techniques that improve the naturalness and expressivity of the voice such as the use of long units.
There has been interest for many years in diphone-based speech synthesis and, recently, a rapidly increasing interest in unit selection-based synthesis (as illustrated by interest in the CHATR system). The limits of both systems are well known. While intelligibility is generally very high for diphone-based systems, the resulting signals do not sound completely natural. This happens for several reasons, amongst them the limited number of phone variants present in a typical system, and the cost of concatenating at diphone boundaries. For unit selection synthesis, typically phone-based, it is possible to produce sentences that sound surprisingly natural and intelligible from a large database. However, quality is often not consistent, and the main difficulties appear to be related to selecting acoustically appropriate units from a large database with the correct prosodic characteristics. Typically no prosody modification is done. In an effort to capture the best features of both systems a unit-selection and synthesis algorithm has been devised that allows finer control than the CHATR system (version 0.8), both by applying selective prosody modification and by exercising finer control over the units that get chosen for synthesis. Results of experiments based on this version of unit selection synthesis will be presented.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.