Alistair Conkie scite author profile

The new AT&T Text-To-Speech (TTS) system for general U.S. English text is based on best-choice components of the AT&T Flextalk TTS, the Festival System from the University of Edinburgh, and ATR's CHATR system. From Flextalk, it employs text normalization, letter-to-sound, and prosody generation. Festival provides a flexible and modular architecture for easy experimentation and competitive evaluation of different algorithms or modules. In addition, we adopted CHATR's unit selection algorithms and modified them in an attempt to guarantee high intelligibility under all circumstances. Finally, we have added our own Harmonic plus Noise Model (HNM) backend for synthesizing the output speech. Most decisions made during the research and development phase of this system were based on formal subjective evaluations. We feel that the new system goes a long way toward delivering on the long-standing promise of truly natural-sounding, as well as highly intelligible, synthesis.

show abstract

Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System

Capes¹,

Coles²,

Conkie

et al. 2017

View full text Add to dashboard Cite

This paper describes Apple's hybrid unit selection speech synthesis system, which provides the voices for Siri with the requirement of naturalness, personality and expressivity. It has been deployed into hundreds of millions of desktop and mobile devices (e.g. iPhone, iPad, Mac, etc.) via iOS and macOS in multiple languages. The system is following the classical unit selection framework with the advantage of using deep learning techniques to boost the performance. In particular, deep and recurrent mixture density networks are used to predict the target and concatenation reference distributions for respective costs during unit selection. In this paper, we present an overview of the run-time TTS engine and the voice building process. We also describe various techniques that enable on-device capability such as preselection optimization, caching for low latency, and unit pruning for low footprint, as well as techniques that improve the naturalness and expressivity of the voice such as the use of long units.

show abstract

A robust unit selection system for speech synthesis

Conkie

1999

View full text Add to dashboard Cite

There has been interest for many years in diphone-based speech synthesis and, recently, a rapidly increasing interest in unit selection-based synthesis (as illustrated by interest in the CHATR system). The limits of both systems are well known. While intelligibility is generally very high for diphone-based systems, the resulting signals do not sound completely natural. This happens for several reasons, amongst them the limited number of phone variants present in a typical system, and the cost of concatenating at diphone boundaries. For unit selection synthesis, typically phone-based, it is possible to produce sentences that sound surprisingly natural and intelligible from a large database. However, quality is often not consistent, and the main difficulties appear to be related to selecting acoustically appropriate units from a large database with the correct prosodic characteristics. Typically no prosody modification is done. In an effort to capture the best features of both systems a unit-selection and synthesis algorithm has been devised that allows finer control than the CHATR system (version 0.8), both by applying selective prosody modification and by exercising finer control over the units that get chosen for synthesis. Results of experiments based on this version of unit selection synthesis will be presented.

show abstract

Design as intelligent behaviour: An AI in design research programme

Smithers

Conkie

Doheny

et al. 1990

Artificial Intelligence in Engineering

View full text Add to dashboard Cite

Optimal Coupling of Diphones

Conkie¹,

Isard²

1997

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Alistair Conkie

The AT&T Next-Gen TTS System

Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System

A robust unit selection system for speech synthesis

Design as intelligent behaviour: An AI in design research programme

Optimal Coupling of Diphones

Contact Info

Product

Resources

About