MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

Tae, Jaesung; Kim, Hyeongju; Lee, Younggun

doi:10.1109/mlsp52302.2021.9596184

Cited by 8 publications

(3 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multi-layer perceptron (MLP) is a neural network with forward structure. It has a simple structure and strong adaptive ability, and it is widely used in the fields of natural language processing [68] and computer vision [69]. Although CNN-based and transformer-based networks are the mainstream choices in the field of computer vision, researchers still try to build the network architecture completely using MLP to explore more possibilities for visual network architecture.…”

Section: Mlp-based Architecturesmentioning

confidence: 99%

A Survey on Image Semantic Segmentation Using Deep Learning Techniques

Cheng¹,

Li²,

Li³

et al. 2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

Image semantic segmentation is an important branch of computer vision of a wide variety of practical applications such as medical image analysis, autonomous driving, virtual or augmented reality, etc. In recent years, due to the remarkable performance of transformer and multilayer perceptron (MLP) in computer vision, which is equivalent to convolutional neural network (CNN), there has been a substantial amount of image semantic segmentation works aimed at developing different types of deep learning architecture. This survey aims to provide a comprehensive overview of deep learning methods in the field of general image semantic segmentation. Firstly, the commonly used image segmentation datasets are listed. Next, extensive pioneering works are deeply studied from multiple perspectives (e.g., network structures, feature fusion methods, attention mechanisms), and are divided into four categories according to different network architectures: CNN-based architectures, transformer-based architectures, MLP-based architectures, and others. Furthermore, this paper presents some common evaluation metrics and compares the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value on the most widely used datasets. Finally, possible future research directions and challenges are discussed for the reference of other researchers.

show abstract

Section: Mlp-based Architecturesmentioning

confidence: 99%

A Survey on Image Semantic Segmentation Using Deep Learning Techniques

Cheng¹,

Li²,

Li³

et al. 2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

show abstract

“…To further distinguish vowels and consonants, a duration predictor is built to produce fine-grained *Corresponding author. phoneme-level duration, which is trained based on supervision calculated by force-alignment [6][7][8][9][10][11], heuristics [12][13][14][15] etc. The advantage of this type of feature processing strategy is that the input phoneme and pitch sequence are strictly aligned at the note level based on the music score.…”

Section: Introductionmentioning

confidence: 99%

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

Wu¹,

Shi²,

Tao³

et al. 2023

Preprint

View full text Add to dashboard Cite

Singing voice synthesis (SVS), as a specific task for generating the vocal singing voice from a music score, has drawn much attention in recent years. SVS faces the challenge that the singing has various pronunciation flexibility conditioned on the same music score. Most of the previous works of SVS can not well handle the misalignment between the music score and actual singing. In this paper, we propose an acoustic feature processing strategy, named PHONEix, with a phoneme distribution predictor, to alleviate the gap between the music score and the singing voice, which can be easily adopted in different SVS systems. Extensive experiments in various settings demonstrate the effectiveness of our PHONEix in both objective and subjective evaluations.

show abstract

“…While the convolutional layers can process variable length sequences and capture short-term correlations in speech, long-term contextual information may not easily be handled by convolutional layers compared with MLPs. In [16] and [17], MLP-based models were applied to speech or audio signals of fixed maximum length. A keyword spotting method based on a structure similar to the MLP-mixer employing the dynamic convolution [18] and the squeeze-and-excitation network (SENet) [19] is proposed in [20].…”

Section: Introductionmentioning

confidence: 99%

Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network

2022

View full text Add to dashboard Cite

The Conformer has shown impressive performance for speech enhancement by exploiting the local and global contextual information, although it requires high computational complexity and many parameters. Recently, multi-layer perceptron (MLP)-based models such as MLP-mixer and gMLP have demonstrated comparable performances with much less computational complexity in the computer vision area. These models showed that all-MLP architectures may perform as good as more advanced structures, but the nature of the MLP limits the application of these architectures to the input with a variable length such as speech and audio. In this paper, we propose the cgMLP-SE model, which is a gMLP-based architecture with convolutional token mixing modules and squeeze-and-excitation network (SENet) to utilize both local and global contextual information as in the Conformer. Specifically, the token-mixing modules in gMLP are replaced by convolutional layers, SENet-based gating is applied on top of the convolutional gating module, and additional feed-forward layers are added to make the cgMLP-SE module a macaron-like structure sandwiched by feed-forward layers like a Conformer block. Experimental results on the TIMIT-DNS noise dataset and the Voice Bank-DEMAND dataset showed that the proposed method exhibited similar speech quality and intelligibility to the Conformer with a smaller model size and less computational complexity.

show abstract

MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

Cited by 8 publications

References 8 publications

A Survey on Image Semantic Segmentation Using Deep Learning Techniques

A Survey on Image Semantic Segmentation Using Deep Learning Techniques

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network

Contact Info

Product

Resources

About