Investigating a neural all pass warp in modern TTS applications

Schnell, Bastian; Garner, Philip N.

doi:10.1016/j.specom.2021.12.002

Cited by 5 publications

(5 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These variables make it challenging for the TTS system to adapt to every individual's voice [9]. (3) Tendency to Overfit -Overtraining a TTS system based on training data from specific speaker(s) can result in overfitting, where the TTS system becomes too specialized to the training data and performs poorly on new data, i.e., a new speaker's voice or new or rare vocabulary [10]. (4) Challenges in Fine-tuning -Being able to fine-tune a TTS system to a specific speaker's voice can be challenging because it requires adjusting the parameters of an initial model.…”

Section: Adapting To a Specific Speaker's Voicementioning

confidence: 99%

Text-To-Speech in Voice Assistants: Challenges and Mitigation Strategies

Vishnu Kadam

2023

J Eng App Sci Technol

View full text Add to dashboard Cite

Voice Assistants (VAs) have grown rapidly from technological novelties to integral parts of our daily lives to perform tasks like streaming music or news, setting alarms or responding to questions. These virtual conversational agents rely on an intricate combination of technologies, and one of the pivotal components is Text-to-Speech (TTS) synthesis. In this paper, we delve into the technical intricacies of TTS in voice assistants, addressing challenges, solutions, and future directions. VAs like Alexa, Siri and Google Assistant have transformed human-computer interactions. The underpinning TTS technology is crucial for converting text-based information into spoken language, making the interaction more natural and accessible. The synthesis of human-like speech from textual data is a complex and interdisciplinary domain, encompassing fields such as speech signal processing, natural language processing, deep learning, and linguistics. This paper aims to contribute a detailed analysis of TTS in voice assistants, emphasizing not only the theoretical aspects but also the practical implementation and real-world implications. The paper will examine the challenges associated with TTS, considering its technical, linguistic, and user-centric dimensions. The paper will also present mitigation strategies for these challenges. In a world where voice-driven interactions are becoming commonplace, a deep understanding of TTS is vital. By delving into the depths of this technology, we can unlock its full potential and ensure that voice assistants continue to enrich our lives and technical domains

show abstract

Section: Adapting To a Specific Speaker's Voicementioning

confidence: 99%

Text-To-Speech in Voice Assistants: Challenges and Mitigation Strategies

Vishnu Kadam

2023

J Eng App Sci Technol

View full text Add to dashboard Cite

show abstract

“…Zero-shot attempts to customize target speech from an unseen target speaker's speech by extracting a speaker embedding from the original target speaker's dataset without using parameters. Investigations indicate improved speaker similarity and demonstrate that the neural all-pass warp (APW) using Tacotron2 (encoder-decoder architecture) raises the generalizability of a multi-speaker model with a zero-shot speaker adaptation [53]. However, the zero-shot method usually suffers from inadequate speaker similarity.…”

Section: Speaker Adaptationmentioning

confidence: 99%

A Smart Control System for the Oil Industry Using Text-to-Speech Synthesis Based on IIoT

et al. 2023

View full text Add to dashboard Cite

Oil refineries have high operating expenses and are often exposed to increased asset integrity risks and functional failure. Real-time monitoring of their operations has always been critical to ensuring safety and efficiency. We proposed a novel Industrial Internet of Things (IIoT) design that employs a text-to-speech synthesizer (TTS) based on neural networks to build an intelligent extension control system. We enhanced a TTS model to achieve high inference speed by employing HiFi-GAN V3 vocoder in the acoustic model FastSpeech 2. We experimented with our system on a low resources-embedded system in a real-time environment. Moreover, we customized the TTS model to generate two target speakers (female and male) using a small dataset. We performed an ablation analysis by conducting experiments to evaluate the performance of our design (IoT connectivity, memory usage, inference speed, and output speech quality). The results demonstrated that our system Real-Time Factor (RTF) is 6.4 (without deploying the cache mechanism, which is a technique to call the previously synthesized speech sentences in our system memory). Using the cache mechanism, our proposed model successfully runs on a low-resource computational device with real-time speed (RTF equals 0.16, 0.19, and 0.29 when the memory has 250, 500, and 1000 WAV files, respectively). Additionally, applying the cache mechanism has reduced memory usage percentage from 16.3% (for synthesizing a sentence of ten seconds) to 6.3%. Furthermore, according to the objective speech quality evaluation, our TTS model is superior to the baseline TTS model.

show abstract

“…Kumar N et al presented a novel zero-shot multi-speaker speech synthesis approach (ZSM-SS) [25]. Compared to the normalization architecture, ZSM-SS added non-autoregressive multi-head attention between the encoder-decoder architecture [26][27][28].…”

Section: Ttsmentioning

confidence: 99%

An Improved Chinese Pause Fillers Prediction Module Based on RoBERTa

Yu,

Zhou,

Niu

2023

Applied Sciences

View full text Add to dashboard Cite

The prediction of pause fillers plays a crucial role in enhancing the naturalness of synthesized speech. In recent years, neural networks, including LSTM, BERT, and XLNet, have been employed for pause fillers prediction modules. However, these methods have exhibited relatively lower accuracy in predicting pause fillers. This paper introduces the utilization of the RoBERTa model for predicting Chinese pause fillers and presents a novel approach to training the RoBERTa model, effectively enhancing the accuracy of Chinese pause fillers prediction. Our proposed approach involves categorizing text from different speakers into four distinct style groups based on the frequency and position of Chinese pause fillers. The RoBERTa model is trained on these four groups of data, which incorporate different styles of fillers, thereby ensuring a more natural synthesis of speech. The Chinese pause fillers prediction module is evaluated on systems such as Parallel Tacotron2, FastPitch, and Deep Voice3, achieving a notable 26.7% improvement in word-level prediction accuracy compared to the BERT model, along with a 14% enhancement in position-level prediction accuracy. This substantial improvement results in a significant enhancement of the naturalness of the generated speech.

show abstract

Investigating a neural all pass warp in modern TTS applications

Cited by 5 publications

References 17 publications

Text-To-Speech in Voice Assistants: Challenges and Mitigation Strategies

Text-To-Speech in Voice Assistants: Challenges and Mitigation Strategies

A Smart Control System for the Oil Industry Using Text-to-Speech Synthesis Based on IIoT

An Improved Chinese Pause Fillers Prediction Module Based on RoBERTa

Contact Info

Product

Resources

About