The 14th International Conference on Auditory-Visual Speech Processing 2017
DOI: 10.21437/avsp.2017-9
|View full text |Cite
|
Sign up to set email alerts
|

Using visual speech information and perceptually motivated loss functions for binary mask estimation

Abstract: This work is concerned with using deep neural networks for estimating binary masks within a speech enhancement framework. We first examine the effect of supplementing the audio features used in mask estimation with visual speech information. Visual speech is known to be robust to noise although not necessarily as discriminative as audio features, particularly at higher signal-to-noise ratios. Furthermore, most DNN approaches to mask estimate use the cross-entropy (CE) loss function which aims to maximise class… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 26 publications
0
2
0
Order By: Relevance
“…Like-wise, an alternative CASA method obtains speaker consistent T-F sections and engages speaker prototypes along omitted records methods to group them into communication streams [14]. Websdale and Milner [15] employed unverified huddling to assemble speech constituencies into dual voice assemblies through the extension of the percentage of mid and interior collection gaps. Lekshmi and Sathidevi [2] postulated non-learning-centered speech isolation methods for a specific-channel speech estrangement exploiting Short-Time Fourier Transform (STFT) [3].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Like-wise, an alternative CASA method obtains speaker consistent T-F sections and engages speaker prototypes along omitted records methods to group them into communication streams [14]. Websdale and Milner [15] employed unverified huddling to assemble speech constituencies into dual voice assemblies through the extension of the percentage of mid and interior collection gaps. Lekshmi and Sathidevi [2] postulated non-learning-centered speech isolation methods for a specific-channel speech estrangement exploiting Short-Time Fourier Transform (STFT) [3].…”
Section: Related Workmentioning
confidence: 99%
“…This method was referred to as a deep Boltzmann machine (DBM). Websdale and Milner [15] suggested a technique centered on Recurrent Neural Network (RNN). Using the noisy acoustic sample, RNN can be employed for speech separation.…”
Section: Related Workmentioning
confidence: 99%