Efficient Personalized Speech Enhancement Through Self-Supervised Learning

Sivaraman, Aswin; Kim, Minje

doi:10.1109/jstsp.2022.3181782

Cited by 11 publications

(4 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Third, when we compare the two model sizes, more significant performance improvement is observed when smaller models are in comparison (PLPCNet-S vs. LPCNet-BL-S) than the larger models ((PLPCNet-L vs. LPCNet-BL-L). This trend aligns well with the personalized speech enhancement literature: model personalization benefits compressed model architectures more than the larger ones [15,16,17]. Finally, it is also worth noting that each test sequence is handled by a selected personalized decoder, where the choice is based on the estimated speaker class.…”

Section: Resultssupporting

confidence: 76%

See 1 more Smart Citation

Personalized Neural Speech Codec

Jang,

Yang,

Lim

et al. 2024

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we propose a personalized neural speech codec, envisioning that personalization can reduce the model complexity or improve perceptual speech quality. Despite the common usage of speech codecs where only a single talker is involved on each side of the communication, personalizing a codec for the specific user has rarely been explored in the literature. First, we assume speakers can be grouped into smaller subsets based on their perceptual similarity. Then, we also postulate that a group-specific codec can focus on the group's speech characteristics to improve its perceptual quality and computational efficiency. To this end, we first develop a Siamese network that learns the speaker embeddings from the LibriSpeech dataset, which are then grouped into underlying speaker clusters. Finally, we retrain the LPCNet-based speech codec baselines on each of the speaker clusters. Subjective listening tests show that the proposed personalization scheme introduces model compression while maintaining speech quality. In other words, with the same model complexity, personalized codecs produce better speech quality.

show abstract

Section: Resultssupporting

confidence: 76%

“…Personalization has shown promising results in model compression tasks for speech enhancement [15,16,17,18]. A personalized model adapts to the target speaker group's speech trait, narrowing the training task down to a smaller subtask, i.e., defined by the smaller speaker group than the entire speakers in the corpus.…”

Section: Introductionmentioning

confidence: 99%

Personalized Neural Speech Codec

Jang,

Yang,

Lim

et al. 2024

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Moreover, Tao et al [42] presented a method called Neighbor2Neighbor to train an effective image-denoising model without only noisy images. Aswin et al [43] proposed self-supervised learning methods as a solution to both zero-and few-shot personalization tasks. Sonining et al [37] investigated the performance of such a time-domain network (Conv-TasNet) for speech denoising in a real-time setting, comparing various parameters settings.…”

Section: Related Workmentioning

confidence: 99%

DeepLabV3+ Vision Transformer for Visual Bird Sound Denoising

Wang

Zhang

2023

IEEE Access

View full text Add to dashboard Cite

Audio denoising is a task to improve the perceptual quality of noisy audio signals. There is still residual noise after the denoising of noisy signals, which will affect the quality of audio data. Traditional and deep learning-based methods are still limited to the manual addition of artificial noise or low-frequency noise. Recently, audio denoising has been transformed into an image segmentation problem, and deep neural networks have been applied to solve this problem. However, its performance is limited to shallow image segmentation models. This paper proposes a novel vision transformer model for visual bird sound denoising, combining a pyramid transformer and DeepLabV3+ network (named PtDeepLab) to filter out the noise. The proposed PtDeepLab model is based on the pyramid transformer, which generates long-range and multiscale representations. The PtDeepLab model can achieve intuitive noise reduction in audio, which helps to separate clean audio from the mixture signal. Extensive experimental results showed that the proposed model has a better denoising performance than state-of-the-art methods.

show abstract

“…Self-supervised learning can be used to train on individual data to build patient-specific models. For data that is already split at subject level, we can apply self-supervised learning directly [459]. When the data is not split, we can apply clustering to find subgroups in the data to apply self-supervised learning on [460,461].…”

Section: Personalized Modelsmentioning

confidence: 99%

Self-supervised learning for early detection of neurodegenerative diseases with small data

Jiang¹

View full text Add to dashboard Cite

would also like to thank co-supervisor Prof. Chin and mentor Prof. Yu for sharing their knowledge and improving me as a researcher. This journey would have been impossible without their guidance and supervision.My sincere gratitude to my friends, Bryan, Jonathan, and Han Yue for providing peer-support and livening up the journey during the tough times. The memories made over the different seasons will be fondly remembered. I would also like to thank the staff of LILY and the Alibaba-NTU JRI who have helped me along the way. Special thanks to Huang Bo for providing the technical support with servers and Zhiwei for spending time to go through the initial outline of this thesis.Last but not least, my eternal gratitude to my family for their unwavering support since day one of this journey. vii

show abstract

Efficient Personalized Speech Enhancement Through Self-Supervised Learning

Cited by 11 publications

References 47 publications

Personalized Neural Speech Codec

Personalized Neural Speech Codec

DeepLabV3+ Vision Transformer for Visual Bird Sound Denoising

Self-supervised learning for early detection of neurodegenerative diseases with small data

Contact Info

Product

Resources

About