Noisy situations cause huge problems for suffers of hearing loss as hearing aids often make the signal more audible but do not always restore the intelligibility.In noisy settings, humans routinely exploit the audio-visual (AV) nature of the speech to selectively suppress the background noise and to focus on the target speaker. In this paper, we present a causal, language, noise and speaker independent AV deep neural network (DNN) architecture for speech enhancement (SE).The model exploits the noisy acoustic cues and noise robust visual cues to focus on the desired speaker and improve the speech intelligibility. To evaluate the proposed SE framework a first of its kind AV binaural speech corpus, called AS-PIRE, is recorded in real noisy environments including cafeteria and restaurant.We demonstrate superior performance of our approach in terms of objective measures and subjective listening tests over the state-of-the-art SE approaches as well as recent DNN based SE models. In addition, our work challenges a popular belief that a scarcity of multi-language large vocabulary AV corpus and wide variety of noises is a major bottleneck to build a robust language, speaker and noise independent SE systems. We show that a model trained on synthetic mixture of Grid corpus (with 33 speakers and a small English vocabulary) and ChiME 3 Noises (consisting of only bus, pedestrian, cafeteria, and street noises) generalise well not only on large vocabulary corpora but also on completely unrelated languages (such as Mandarin), wide variety of speakers and noises.
This paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The proposed approach leverages the complementary strengths of both deep learning and analytical acoustic modelling (filtering based approach) as compared to recently published, comparatively simpler benchmark approaches that rely only on deep learning. The proposed audio-visual (AV) speech enhancement framework operates at two levels. In the first level, a novel deep learning based lipreading regression model is employed. In the second level, lipreading approximated clean-audio features are exploited, using an enhanced, visually-derived Wiener filter (EVWF), for the clean audio power spectrum estimation. Specifically, a stacked longshort-term memory (LSTM) based lip-reading regression model is designed for clean audio features estimation using only temporal visual features (i.e. lip reading) considering different number of prior visual frames. For clean speech spectrum estimation, a new filterbank-domain EVWF is formulated, which exploits estimated speech features. The proposed EVWF is compared with conventional Spectral Subtraction (SS) and Log-Minimum Mean-Square Error (LMMSE) methods using both ideal AV mapping and LSTM driven AV mapping. The potential of the proposed speech enhancement framework is evaluated under four different dynamic real-world commercially-motivated scenarios (cafe, street junction, public transport (BUS), pedestrian area) at different SNR levels (ranging from low to high SNRs) using benchmark Grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality (PESQ) is used to evaluate the quality of restored speech. For subjective testing, the standard mean-opinion-score (MOS) method is used with inferential statistics. Comparative simulation results demonstrate significant lip-reading and speech enhancement improvement in terms of both speech quality and speech intelligibility. Ongoing work is aimed at enhancing the accuracy and generalization capability of the deep learning driven lip-reading model, and contextual integration of AV cues, for context-aware autonomous AV speech enhancement.
Data annotation is an important but time-consuming and costly procedure. To sort a text into two classes, the very first thing we need is a good annotation guideline, establishing what is required to qualify for each class. In the literature, the difficulties associated with an appropriate data annotation has been underestimated. In this paper, we present a novel approach to automatically construct an annotated sentiment corpus for Algerian dialect (A Maghrebi Arabic dialect). The construction of this corpus is based on an Algerian sentiment lexicon that is also constructed automatically. The presented work deals with the two widely used scripts on Arabic social media: Arabic and Arabizi. The proposed approach automatically constructs a sentiment corpus containing 8000 messages (where 4000 are dedicated to Arabic and 4000 to Arabizi). The achieved F1-score is up to 72% and 78% for an Arabic and Arabizi test sets, respectively. Ongoing work is aimed at integrating transliteration process for Arabizi messages to further improve the obtained results.
Human speech processing is inherently multimodal, where visual cues (lip movements) help to better understand the speech in noise. Lip-reading driven speech enhancement significantly outperforms benchmark audio-only approaches at low signal-to-noise ratios (SNRs). However, at high SNRs or low levels of background noise, visual cues become fairly less effective for speech enhancement, and audio-only cues work well enough. Therefore, a more optimal, context-aware audio-visual (AV) system is required, that contextually utilises both visual and noisy audio features and effectively accounts for different noisy conditions. In this paper, we introduce a novel contextual AV switching component that contextually exploits AV cues with respect to different operating conditions to estimate clean audio, without requiring any SNR estimation. The switching module switches between visual-only (V-only), audio-only (A-only), and both audio-visual cues at low, high and moderate SNR levels, respectively. The contextual AV switching component is developed by integrating a convolutional neural network (CNN) and long-short-term memory (LSTM) network. For testing, the estimated clean audio features are utilised by the developed novel enhanced visually derived Wiener filter (EVWF) for clean audio power spectrum estimation. The contextual AV speech enhancement method is evaluated under dynamic real-world scenarios (cafe, street, BUS, pedestrian) at different SNR levels (ranging from low to high SNRs) using benchmark Grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality (PESQ) is used to evaluate the quality of the restored speech. For subjective testing, the standard mean-opinion-score (MOS) method is used. The critical analysis and comparative study demonstrate the outperformance of proposed contextual AV approach, over A-only, V-only, spectral subtraction (SS), and log-minimum mean square error (LMMSE) based speech enhancement methods at both low and high SNRs, revealing its capability to tackle spectro-temporal variation in any real-world noisy condition. Simulation results also validate the phenomenon of less effective visual cues at high SNRs, less effective audio cues at low SNRs, and complementary audio and visual cues strength. Lastly, the benefit of using visual cues at low SNRs is demonstrated using colour spectrogram where visual cues better recovered the speech components at specific time-frequency units as compared to A-only cues.
Extreme events and disasters resulting from climate change or other ecological factors are difficult to predict and manage. Current limitations of state-ofthe-art approaches to disaster prediction and management could be addressed by adopting new unorthodox risk assessment and management strategies. The next generation Internet of Things (IoT), Wireless Sensor Networks (WSNs), 5G wireless communication, and big data analytics technologies are the key enablers for future effective disaster management infrastructures. In this chapter, we commissioned a survey on emerging wireless communication technologies with potential for enhancing disaster prediction, monitoring, and management systems. Challenges, opportunities, and future research trends are highlighted to provide some insight on the potential future work for researchers in this field.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.