A Data Driven Approach to Audiovisual Speech Mapping

Abel, Andrew; Marxer, Ricard; Barker, Jon; Watt, Roger; Whitmer, William M.; Derleth, Peter; Hussain, Amir

doi:10.1007/978-3-319-49685-6_30

Cited by 8 publications

(11 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, to ensure good lip tracking, each sentence is manually validated by inspecting few frames from each sentence. The aim of manual validation is to delete those sentences in which lip regions are not correctly identified [31]. Lip tracking optimisation lies outside the scope of the present work.…”

Section: Visual Feature Extractionmentioning

confidence: 99%

Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments

2020

Self Cite

View full text Add to dashboard Cite

Human speech processing is inherently multimodal, where visual cues (lip movements) help to better understand the speech in noise. Lip-reading driven speech enhancement significantly outperforms benchmark audio-only approaches at low signal-to-noise ratios (SNRs). However, at high SNRs or low levels of background noise, visual cues become fairly less effective for speech enhancement, and audio-only cues work well enough. Therefore, a more optimal, context-aware audio-visual (AV) system is required, that contextually utilises both visual and noisy audio features and effectively accounts for different noisy conditions. In this paper, we introduce a novel contextual AV switching component that contextually exploits AV cues with respect to different operating conditions to estimate clean audio, without requiring any SNR estimation. The switching module switches between visual-only (V-only), audio-only (A-only), and both audio-visual cues at low, high and moderate SNR levels, respectively. The contextual AV switching component is developed by integrating a convolutional neural network (CNN) and long-short-term memory (LSTM) network. For testing, the estimated clean audio features are utilised by the developed novel enhanced visually derived Wiener filter (EVWF) for clean audio power spectrum estimation. The contextual AV speech enhancement method is evaluated under dynamic real-world scenarios (cafe, street, BUS, pedestrian) at different SNR levels (ranging from low to high SNRs) using benchmark Grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality (PESQ) is used to evaluate the quality of the restored speech. For subjective testing, the standard mean-opinion-score (MOS) method is used. The critical analysis and comparative study demonstrate the outperformance of proposed contextual AV approach, over A-only, V-only, spectral subtraction (SS), and log-minimum mean square error (LMMSE) based speech enhancement methods at both low and high SNRs, revealing its capability to tackle spectro-temporal variation in any real-world noisy condition. Simulation results also validate the phenomenon of less effective visual cues at high SNRs, less effective audio cues at low SNRs, and complementary audio and visual cues strength. Lastly, the benefit of using visual cues at low SNRs is demonstrated using colour spectrogram where visual cues better recovered the speech components at specific time-frequency units as compared to A-only cues.

show abstract

Section: Visual Feature Extractionmentioning

confidence: 99%

Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments

2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…In addition, to ensure good lip tracking, each sentence is manually validated by inspecting a few frames from each sentence. The aim of manual validation is to delete those sentences in which lip regions are not correctly identified (Abel et al, 2016;Adeel et al, 2019b).…”

Section: Audio-visual Corpus and Feature Extractionmentioning

confidence: 99%

Conscious Multisensory Integration: Introducing a Universal Contextual Field in Biological and Deep Artificial Neural Networks

Adeel

2020

Front. Comput. Neurosci.

View full text Add to dashboard Cite

show abstract

“…In contrast, not much work has been conducted to model lip reading as a regression problem for speech enhancement [27][28] [29]. 2) A critical analysis of the proposed LSTM based lipreading regression model and its comparison with the conventional MLP based regression model [31], where LSTM model has shown better capability to learn the correlation between lip movements and speech as compared to the conventional MLP models, particularly, when different number of prior visual frames are considered. 3) Addressed limitations of state-of-the-art VWF by presenting a novel EVWF.…”

Section: Introductionmentioning

confidence: 99%

Lip-Reading Driven Deep Learning Approach for Speech Enhancement

Adeel¹,

Gogate

Hussain

et al. 2021

IEEE Trans. Emerg. Top. Comput. Intell.

Self Cite

View full text Add to dashboard Cite

This paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The proposed approach leverages the complementary strengths of both deep learning and analytical acoustic modelling (filtering based approach) as compared to recently published, comparatively simpler benchmark approaches that rely only on deep learning. The proposed audio-visual (AV) speech enhancement framework operates at two levels. In the first level, a novel deep learning based lipreading regression model is employed. In the second level, lipreading approximated clean-audio features are exploited, using an enhanced, visually-derived Wiener filter (EVWF), for the clean audio power spectrum estimation. Specifically, a stacked longshort-term memory (LSTM) based lip-reading regression model is designed for clean audio features estimation using only temporal visual features (i.e. lip reading) considering different number of prior visual frames. For clean speech spectrum estimation, a new filterbank-domain EVWF is formulated, which exploits estimated speech features. The proposed EVWF is compared with conventional Spectral Subtraction (SS) and Log-Minimum Mean-Square Error (LMMSE) methods using both ideal AV mapping and LSTM driven AV mapping. The potential of the proposed speech enhancement framework is evaluated under four different dynamic real-world commercially-motivated scenarios (cafe, street junction, public transport (BUS), pedestrian area) at different SNR levels (ranging from low to high SNRs) using benchmark Grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality (PESQ) is used to evaluate the quality of restored speech. For subjective testing, the standard mean-opinion-score (MOS) method is used with inferential statistics. Comparative simulation results demonstrate significant lip-reading and speech enhancement improvement in terms of both speech quality and speech intelligibility. Ongoing work is aimed at enhancing the accuracy and generalization capability of the deep learning driven lip-reading model, and contextual integration of AV cues, for context-aware autonomous AV speech enhancement.

show abstract

A Data Driven Approach to Audiovisual Speech Mapping

Cited by 8 publications

References 13 publications

Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments

Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments

Conscious Multisensory Integration: Introducing a Universal Contextual Field in Biological and Deep Artificial Neural Networks

Lip-Reading Driven Deep Learning Approach for Speech Enhancement

Contact Info

Product

Resources

About