Speech Enhancement with Weakly Labelled Data from AudioSet

Kong, Qiuqiang; Liu, Haohe; Du, Xingjian; Chen, Li; Xia, Rui; Wang, Yu-Xuan

doi:10.21437/interspeech.2021-259

Cited by 5 publications

(8 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3) AudioCaps: Our downloaded test set of the AudioCaps dataset [33] includes 957 audio clips, each annotated with five captions. To generate audio mixtures, we initially select an audio clip from the test set to serve as the target source, followed by a random selection of another audio clip as the background source, ensuring that the sound event tag 7 of the background source does not coincide with that of the target source. For the test mixtures, each test audio is mixed with five randomly chosen background sources with an SNR at 0 dB.…”

Section: Datasets and Evaluation Benchmarkmentioning

confidence: 99%

“…The test set of the Voicebank-Demand dataset includes a total of 824 utterances, which is used to evaluate the zero-shot performance of our model on speech enhancement. To make a fair comparison with previous speech enhancement systems [6], [7], [70], [71], we resample all audio clips at 16 kHz. We use "Speech" as the input text query to perform speech enhancement.…”

Section: ) Esc-50mentioning

confidence: 99%

“…We utilize signal-to-distortion ratio improvement (SDRi) [15], [20] and scale-invariant SDR (SI-SDR) [72] to evaluate the performance of sound separation tasks. For the speech enhancement task, following previous works [6], [7], [70], [71], we apply the Perceptual evaluation of speech quality (PESQ) [73], Mean opinion score (MOS) predictor of signal distortion (CSIG), MOS predictor of background-noise intrusiveness (CBAK), MOS predictor of overall signal quality (COVL) [74] and segmental signal-to-ratio noise (SSNR) [75] for evaluation. For each evaluation metric, higher values indicate better performance.…”

Section: Evaluation Metricsmentioning

confidence: 99%

“…Sound separation is a fundamental research task for CASA, which aims to separate real-world sound recordings into individual source tracks, also known as the "cocktail party problem" [2]. Sound separation has a wide range of applications, including audio event separation [3], [4], music source separation [5], and speech enhancement [6], [7].…”

Section: Introductionmentioning

confidence: 99%

“…Many previous works on sound separation mainly focus on separating one or a few sources such as speech enhancement [6], [7], speech separation [8], [9], and music source separation [5]. Recently, universal sound separation (USS) [4] has attracted a lot of research interest.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Separate What You Describe: Language-Queried Audio Source Separation

Liu¹,

Liu²,

Kong³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep.

show abstract

Section: Datasets and Evaluation Benchmarkmentioning

confidence: 99%

Section: ) Esc-50mentioning

confidence: 99%

Section: Evaluation Metricsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Separate What You Describe: Language-Queried Audio Source Separation

Liu¹,

Liu²,

Kong³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

DeepLabV3+ Vision Transformer for Visual Bird Sound Denoising

Wang

Zhang

2023

IEEE Access

View full text Add to dashboard Cite

Audio denoising is a task to improve the perceptual quality of noisy audio signals. There is still residual noise after the denoising of noisy signals, which will affect the quality of audio data. Traditional and deep learning-based methods are still limited to the manual addition of artificial noise or low-frequency noise. Recently, audio denoising has been transformed into an image segmentation problem, and deep neural networks have been applied to solve this problem. However, its performance is limited to shallow image segmentation models. This paper proposes a novel vision transformer model for visual bird sound denoising, combining a pyramid transformer and DeepLabV3+ network (named PtDeepLab) to filter out the noise. The proposed PtDeepLab model is based on the pyramid transformer, which generates long-range and multiscale representations. The PtDeepLab model can achieve intuitive noise reduction in audio, which helps to separate clean audio from the mixture signal. Extensive experimental results showed that the proposed model has a better denoising performance than state-of-the-art methods.

show abstract

Category-Adapted Sound Event Enhancement with Weakly Labeled Data

Dinkel

et al. 2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech Enhancement with Weakly Labelled Data from AudioSet

Cited by 5 publications

References 0 publications

Separate What You Describe: Language-Queried Audio Source Separation

Separate What You Describe: Language-Queried Audio Source Separation

DeepLabV3+ Vision Transformer for Visual Bird Sound Denoising

Category-Adapted Sound Event Enhancement with Weakly Labeled Data

Contact Info

Product

Resources

About