A Comparison of Perceptually Motivated Loss Functions for Binary Mask Estimation in Speech Separation

Websdale, Danny; Milner, Ben

doi:10.21437/interspeech.2017-1504

Cited by 4 publications

(4 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bregman in his book attributed auditory segregation to Auditory Scene Analysis (ASA) and summarized the segregation process into two stages: segmentation and grouping [6]. ASA procedure was applied by the individual acoustic system to segregate sound [7]. ASA evaluated and recuperated single and distinct sound from an amalgam of noises to generate expressive speech essentials once the sound fundamentals had been removed.…”

Section: Related Workmentioning

confidence: 99%

Speech Segregation in Background Noise Based on Deep Learning

et al. 2020

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Speech Segregation in Background Noise Based on Deep Learning

et al. 2020

View full text Add to dashboard Cite

“…In the literature, extensive research has been carried out to develop speech separation methods for speech recognition [6,7]. Researchers have proposed several different speech separation models such as parametric mask estimation methods [8,9,10], neural network based mask estimation methods [11,12], and novel loss functions [13]. However, limited work has been conducted to develop robust speaker independent audio-visual speech separation models to perform enhancement.…”

Section: Introductionmentioning

confidence: 99%

“…The few attempts to address this problem have been restricted to speaker dependent scenarios. In [13], audio and visual features are first concatenated into a single vector. The concatenated vector is then used to train a non-causal speakerdependent DNN with a perceptually motivated loss function inspired by the hit minus false-alarm (HIT-FA) rate.…”

Section: Introductionmentioning

confidence: 99%

DNN Driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

Gogate¹,

Adeel²,

Marxer³

et al. 2018

Interspeech 2018

View full text Add to dashboard Cite

Human auditory cortex excels at selectively suppressing background noise to focus on a target speaker. The process of selective attention in the brain is known to contextually exploit the available audio and visual cues to better focus on target speaker while filtering out other noises. In this study, we propose a novel deep neural network (DNN) based audiovisual (AV) mask estimation model. The proposed AV mask estimation model contextually integrates the temporal dynamics of both audio and noise-immune visual features for improved mask estimation and speech separation. For optimal AV features extraction and ideal binary mask (IBM) estimation, a hybrid DNN architecture is exploited to leverages the complementary strengths of a stacked long short term memory (LSTM) and convolution LSTM network. The comparative simulation results in terms of speech quality and intelligibility demonstrate significant performance improvement of our proposed AV mask estimation model as compared to audio-only and visual-only mask estimation approaches for both speaker dependent and independent scenarios.

show abstract

“…The human auditory system segregates sound using a process known as auditory scene analysis (ASA) [ 10 ]. ASA analyses and recovers single and individual sound from a mixture of sounds to produce meaningful speech elements after removing noise elements.…”

Section: Introductionmentioning

confidence: 99%

A hybrid technique for speech segregation and classification using a sophisticated deep neural network

et al. 2018

View full text Add to dashboard Cite

Recent research on speech segregation and music fingerprinting has led to improvements in speech segregation and music identification algorithms. Speech and music segregation generally involves the identification of music followed by speech segregation. However, music segregation becomes a challenging task in the presence of noise. This paper proposes a novel method of speech segregation for unlabelled stationary noisy audio signals using the deep belief network (DBN) model. The proposed method successfully segregates a music signal from noisy audio streams. A recurrent neural network (RNN)-based hidden layer segregation model is applied to remove stationary noise. Dictionary-based fisher algorithms are employed for speech classification. The proposed method is tested on three datasets (TIMIT, MIR-1K, and MusicBrainz), and the results indicate the robustness of proposed method for speech segregation. The qualitative and quantitative analysis carried out on three datasets demonstrate the efficiency of the proposed method compared to the state-of-the-art speech segregation and classification-based methods.

show abstract

A Comparison of Perceptually Motivated Loss Functions for Binary Mask Estimation in Speech Separation

Cited by 4 publications

References 29 publications

Speech Segregation in Background Noise Based on Deep Learning

Speech Segregation in Background Noise Based on Deep Learning

DNN Driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

A hybrid technique for speech segregation and classification using a sophisticated deep neural network

Contact Info

Product

Resources

About