Current trends in multilingual speech processing

Bourlard, Hervé; Dines, John; Magimai-Doss, Mathew; Garner, Philip N.; Motlíček, Petr; Liang, Hui; Saheer, Lakshmi Babu; Valente, Fabio

doi:10.1007/s12046-011-0050-4

Cited by 28 publications

(16 citation statements)

References 86 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In another direction, modelling speakers with a speaker discriminative Deep Neural Network (DNN) has shown good performance for SV [7,8]. Motivated by the success of DNNs in the context of speaker, speech [9,10] and image recognition tasks, we explore the application of DNNs for the Random-digit task. We believe that the DNN based speaker embedding features can be useful for representing the invariant speaker characteristics.…”

Section: Introductionmentioning

confidence: 99%

DNN Based Speaker Embedding Using Content Information for Text-Dependent Speaker Verification

Dey

Koshinaka²,

Motlíček

et al. 2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this paper, we are interested in exploring Deep Neural Network (DNN) based speaker embedding for Random-digit task using content information. To this end, a technique is applied to automatically select common phonetic units between the enrollment and test data to produce speaker verification scores. Furthermore, a novel approach is proposed to incorporate content information in the DNN directly. It is hypothesized that features extracted using this DNN will be helpful for the task. Experiments on the RSR dataset show that the proposed method outperforms the baseline i-vector system by 43% relative equal error rate.

show abstract

Section: Introductionmentioning

confidence: 99%

DNN Based Speaker Embedding Using Content Information for Text-Dependent Speaker Verification

Dey

Koshinaka²,

Motlíček

et al. 2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Expanding the coverage of the world's languages in Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) systems have been attracting much interest in both academia and industry [1,2]. Conventional phonetically-based speech processing systems require pronunciation dictionaries that map phonetic units to words.…”

Section: Introductionmentioning

confidence: 99%

Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes

Zhang

Sainath

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

121

View full text Add to dashboard Cite

We present two end-to-end models: Audio-to-Byte (A2B) and Byteto-Audio (B2A), for multilingual speech recognition and synthesis. Prior work has predominantly used characters, sub-words or words as the unit of choice to model text. These units are difficult to scale to languages with large vocabularies, particularly in the case of multilingual processing. In this work, we model text via a sequence of Unicode bytes, specifically, the UTF-8 variable length byte sequence for each character. Bytes allow us to avoid large softmaxes in languages with large vocabularies, and share representations in multilingual models. We show that bytes are superior to grapheme characters over a wide variety of languages in monolingual end-toend speech recognition. Additionally, our multilingual byte model outperform each respective single language baseline on average by 4.4% relatively. In Japanese-English code-switching speech, our multilingual byte model outperform our monolingual baseline by 38.6% relatively. Finally, we present an end-to-end multilingual speech synthesis model using byte representations which matches the performance of our monolingual baselines.

show abstract

“…In recent years, cross-lingual speech synthesis has been a popular topic in text-to-speech synthesis (TTS) research [1], [2]. Since cross-lingual speech synthesis can synthesize speech in different languages with the same or a different speaker's voice, it has been widely used in human-computer interaction,…”

Section: Introductionmentioning

confidence: 99%

Deep Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

Zhang

Yang

et al. 2019

IEEE Access

View full text Add to dashboard Cite

This paper proposes a deep learning-based Mandarin-Tibetan cross-lingual speech synthesis to realize both Mandarin speech synthesis and Tibetan speech synthesis under a unique framework. Because Tibetan training corpus is hard to record, we train the acoustic models with a large scale Mandarin multispeaker corpus and a small scale Tibetan one-speaker corpus. The acoustic models are trained with deep neural network (DNN), hybrid long short-term memory (LSTM), and hybrid bi-directional long short-term memory (BLSTM). We also further extend our Chinese text analyzer by adding a Tibetan text analyzer for generating context-dependent labels from input Chinese or Tibetan sentences. The Tibetan text analyzer includes a text normalization, a novel Tibetan word segmentation that combines a BLSTM with conditional random field, a prosodic boundary prediction, and a grapheme-to-phoneme conversion. We select the initials and the finals of both Mandarin and Tibetan as the speech synthesis units to train a speakerindependent mixed language average voice model (AVM) with DNN, hybrid LSTM, and hybrid BLSTM from Mandarin and Tibetan mixed corpus. Then the speaker adaptation is applied to train speaker-dependent DNN, hybrid LSTM, or hybrid BLSTM models of Mandarin or Tibetan with a small target speaker corpus from an AVM. Finally, we synthesize the Mandarin speech, or Tibetan speech though the speaker-dependent Mandarin or Tibetan models. The experiments show that the hybrid BLSTM-based cross-lingual speech synthesis framework is better than the other two cross-lingual frameworks and the Tibetan monolingual framework. The mixed Tibetan training corpus does not influence the voice quality of synthesized Mandarin speech. Furthermore, the hybrid BLSTM-based cross-lingual speech synthesis framework only needs 60% of the training corpus to synthesize a similar voice as the Tibetan monolingual framework. Therefore, the proposed method can be used for speech synthesis of low resource languages by borrowing the same tremendous resource language's corpus. INDEX TERMS Mandarin-Tibetan cross-lingual speech synthesis, Tibetan speech synthesis, minority language speech synthesis, deep learning, low resource languages.

show abstract

Current trends in multilingual speech processing

Cited by 28 publications

References 86 publications

DNN Based Speaker Embedding Using Content Information for Text-Dependent Speaker Verification

DNN Based Speaker Embedding Using Content Information for Text-Dependent Speaker Verification

Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes

Deep Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

Contact Info

Product

Resources

About