2023
DOI: 10.1609/aaai.v37i11.26479
|View full text |Cite
|
Sign up to set email alerts
|

Avocodo: Generative Adversarial Network for Artifact-Free Vocoder

Abstract: Neural vocoders based on the generative adversarial neural network (GAN) have been widely used due to their fast inference speed and lightweight networks while generating high-quality speech waveforms. Since the perceptually important speech components are primarily concentrated in the low-frequency bands, most GAN-based vocoders perform multi-scale analysis that evaluates downsampled speech waveforms. This multi-scale analysis helps the generator improve speech intelligibility. However, in preliminary experim… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(1 citation statement)
references
References 28 publications
0
1
0
Order By: Relevance
“…Mehta et al [29], for example, propose the use of Neural Hidden Markov Models (HMMs) with normalising flows as an acoustic model. In the context of generating speech signal from mel spectrograms, vocoders based on Generative Adversarial Networks (GANs) [8,[30][31][32] have gained popularity due to their efficient inference speed, lightweight networks, and ability to produce high-quality waveforms. Furthermore, end-to-end models like VITS [33] or YourTTS [34] have been developed, enabling direct generation of audio signals from linguistic input without the need of an additional vocoder model.…”
Section: Related Workmentioning
confidence: 99%
“…Mehta et al [29], for example, propose the use of Neural Hidden Markov Models (HMMs) with normalising flows as an acoustic model. In the context of generating speech signal from mel spectrograms, vocoders based on Generative Adversarial Networks (GANs) [8,[30][31][32] have gained popularity due to their efficient inference speed, lightweight networks, and ability to produce high-quality waveforms. Furthermore, end-to-end models like VITS [33] or YourTTS [34] have been developed, enabling direct generation of audio signals from linguistic input without the need of an additional vocoder model.…”
Section: Related Workmentioning
confidence: 99%