Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-1504
|View full text |Cite
|
Sign up to set email alerts
|

A Comparison of Perceptually Motivated Loss Functions for Binary Mask Estimation in Speech Separation

Abstract: This work proposes and compares perceptually motivated loss functions for deep learning based binary mask estimation for speech separation. Previous loss functions have focused on maximising classification accuracy of mask estimation but we now propose loss functions that aim to maximise the hit minus false-alarm (HIT-FA) rate which is known to correlate more closely to speech intelligibility. The baseline loss function is binary cross-entropy (CE), a standard loss function used in binary mask estimation, whic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 29 publications
0
4
0
Order By: Relevance
“…Bregman in his book attributed auditory segregation to Auditory Scene Analysis (ASA) and summarized the segregation process into two stages: segmentation and grouping [6]. ASA procedure was applied by the individual acoustic system to segregate sound [7]. ASA evaluated and recuperated single and distinct sound from an amalgam of noises to generate expressive speech essentials once the sound fundamentals had been removed.…”
Section: Related Workmentioning
confidence: 99%
“…Bregman in his book attributed auditory segregation to Auditory Scene Analysis (ASA) and summarized the segregation process into two stages: segmentation and grouping [6]. ASA procedure was applied by the individual acoustic system to segregate sound [7]. ASA evaluated and recuperated single and distinct sound from an amalgam of noises to generate expressive speech essentials once the sound fundamentals had been removed.…”
Section: Related Workmentioning
confidence: 99%
“…In the literature, extensive research has been carried out to develop speech separation methods for speech recognition [6,7]. Researchers have proposed several different speech separation models such as parametric mask estimation methods [8,9,10], neural network based mask estimation methods [11,12], and novel loss functions [13]. However, limited work has been conducted to develop robust speaker independent audio-visual speech separation models to perform enhancement.…”
Section: Introductionmentioning
confidence: 99%
“…The few attempts to address this problem have been restricted to speaker dependent scenarios. In [13], audio and visual features are first concatenated into a single vector. The concatenated vector is then used to train a non-causal speakerdependent DNN with a perceptually motivated loss function inspired by the hit minus false-alarm (HIT-FA) rate.…”
Section: Introductionmentioning
confidence: 99%
“…The human auditory system segregates sound using a process known as auditory scene analysis (ASA) [ 10 ]. ASA analyses and recovers single and individual sound from a mixture of sounds to produce meaningful speech elements after removing noise elements.…”
Section: Introductionmentioning
confidence: 99%