Voice selection for speech synthesis

Syrdal, Ann K.; Conkie, Alistair; Stylianou, Yannis; Schroeter, Juergen; Garrison, Laurie F.; Dutton, Dawn L.

doi:10.1121/1.420883

Cited by 2 publications

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Formal listening tests were conducted throughout the research and development phase of Next-Gen TTS. We believe that selecting the voice for rendering the many hours of inventory speech was the most critical decision [7]. We also have identified acoustic correlates of listener ratings relevant to speaker selection [8].…”

Section: Discussionmentioning

confidence: 99%

The AT&T Next-Gen TTS System

Beutnagel

Conkie

Schroeter

et al. 1999

The Journal of the Acoustical Society of America

Self Cite

138

View full text Add to dashboard Cite

The new AT&T Text-To-Speech (TTS) system for general U.S. English text is based on best-choice components of the AT&T Flextalk TTS, the Festival System from the University of Edinburgh, and ATR's CHATR system. From Flextalk, it employs text normalization, letter-to-sound, and prosody generation. Festival provides a flexible and modular architecture for easy experimentation and competitive evaluation of different algorithms or modules. In addition, we adopted CHATR's unit selection algorithms and modified them in an attempt to guarantee high intelligibility under all circumstances. Finally, we have added our own Harmonic plus Noise Model (HNM) backend for synthesizing the output speech. Most decisions made during the research and development phase of this system were based on formal subjective evaluations. We feel that the new system goes a long way toward delivering on the long-standing promise of truly natural-sounding, as well as highly intelligible, synthesis.

show abstract

Section: Discussionmentioning

confidence: 99%