ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413669
|View full text |Cite
|
Sign up to set email alerts
|

The Thinkit System for Icassp2021 M2voc Challenge

Abstract: In this paper, we introduce the low resource text-to-speech system from the ThinkIT team submitted to Multi-Speaker Multi-Style Voice Cloning Challenge (M2VoC). The challenge has two tasks: few-shot track1 provides 100 samples for each person and one-shot track2 offers 5 samples only. Each track contains two sub-tracks A and B. Instead of subtrack A, sub-track B can use extra public data besides the released data. But we participate in the sub-track A only. We choose the finetune as our backbone strategy. Our … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(1 citation statement)
references
References 10 publications
0
1
0
Order By: Relevance
“…T18 proposed to use a BERT [41] module to predict the break of each Chinese character in an input sentence. T15 [42] used a fine-grained encoder added at the decoder's tailor, which extracts variable-length detailed style information from multiple reference samples via an attention mechanism. T03 and T15 also used global style tokens (GST) for both speaker and style control, which consists of a reference encoder, style attention, and style embedding.…”
Section: Speaker and Style Modelingmentioning
confidence: 99%
“…T18 proposed to use a BERT [41] module to predict the break of each Chinese character in an input sentence. T15 [42] used a fine-grained encoder added at the decoder's tailor, which extracts variable-length detailed style information from multiple reference samples via an attention mechanism. T03 and T15 also used global style tokens (GST) for both speaker and style control, which consists of a reference encoder, style attention, and style embedding.…”
Section: Speaker and Style Modelingmentioning
confidence: 99%