Dual Microphone Voice Activity Detection Based on Reliable Spatial Cues

Hwang, Soojoong; Jin, Yu; Shin, Jong Won

doi:10.3390/s19143056

Cited by 6 publications

(2 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A major flaw in the language-based interface is that it is very sensitive to ambient noise, making it difficult to differentiate and classify signal to noise. To address this, a microphone voice activity detection (VAD) scheme [ 55 ] that enhances performance in a variety of noise environments in consideration of the sparsity of the speech signal in the time-frequency domain is proposed. And the language-based interface with the robot’s control is developed [ 56 ] and it reduced the ambient noise by 30%, the resulting inaccuracy has been improved.…”

Section: Biosignal-based Speech Recognitionmentioning

confidence: 99%

Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review

Lee

Seong

Ozlu

et al. 2021

Sensors

View full text Add to dashboard Cite

Voice is one of the essential mechanisms for communicating and expressing one’s intentions as a human being. There are several causes of voice inability, including disease, accident, vocal abuse, medical surgery, ageing, and environmental pollution, and the risk of voice loss continues to increase. Novel approaches should have been developed for speech recognition and production because that would seriously undermine the quality of life and sometimes leads to isolation from society. In this review, we survey mouth interface technologies which are mouth-mounted devices for speech recognition, production, and volitional control, and the corresponding research to develop artificial mouth technologies based on various sensors, including electromyography (EMG), electroencephalography (EEG), electropalatography (EPG), electromagnetic articulography (EMA), permanent magnet articulography (PMA), gyros, images and 3-axial magnetic sensors, especially with deep learning techniques. We especially research various deep learning technologies related to voice recognition, including visual speech recognition, silent speech interface, and analyze its flow, and systematize them into a taxonomy. Finally, we discuss methods to solve the communication problems of people with disabilities in speaking and future research with respect to deep learning components.

show abstract

Section: Biosignal-based Speech Recognitionmentioning

confidence: 99%

Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review

Lee

Seong

Ozlu

et al. 2021

Sensors

View full text Add to dashboard Cite

show abstract

“…Spatial cues between multi-channel signals such as inter-channel time difference (or inter-channel phase difference) and inter-channel level difference can indicate the location of the speech source. These spatial characteristics have been shown to be particularly beneficial when combined with spectral characteristics over the frequency domain in several fields, such as source separation, speech enhancement, and voice activity detection [ 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 ]. Unfortunately, these spatial features are typically extracted in the frequency domain using STFT, making it difficult to integrate perfectly using the time domain method.…”

Section: Proposed Multi-channel Cross-tower With Attention Mechanimentioning

confidence: 99%

Multi-TALK: Multi-Microphone Cross-Tower Network for Jointly Suppressing Acoustic Echo and Background Noise

Park

Chang

2020

Sensors

View full text Add to dashboard Cite

In this paper, we propose a multi-channel cross-tower with attention mechanisms in latent domain network (Multi-TALK) that suppresses both the acoustic echo and background noise. The proposed approach consists of the cross-tower network, a parallel encoder with an auxiliary encoder, and a decoder. For the multi-channel processing, a parallel encoder is used to extract latent features of each microphone, and the latent features including the spatial information are compressed by a 1D convolution operation. In addition, the latent features of the far-end are extracted by the auxiliary encoder, and they are effectively provided to the cross-tower network by using the attention mechanism. The cross tower network iteratively estimates the latent features of acoustic echo and background noise in each tower. To improve the performance at each iteration, the outputs of each tower are transmitted as the input for the next iteration of the neighboring tower. Before passing through the decoder, to estimate the near-end speech, attention mechanisms are further applied to remove the estimated acoustic echo and background noise from the compressed mixture to prevent speech distortion by over-suppression. Compared to the conventional algorithms, the proposed algorithm effectively suppresses the acoustic echo and background noise and significantly lowers the speech distortion.

show abstract

Speech protected noise cancellation system in noise dominated environments

Usta

Doǧan

2022

Applied Acoustics

View full text Add to dashboard Cite

Dual Microphone Voice Activity Detection Based on Reliable Spatial Cues

Cited by 6 publications

References 30 publications

Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review

Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review

Multi-TALK: Multi-Microphone Cross-Tower Network for Jointly Suppressing Acoustic Echo and Background Noise

Speech protected noise cancellation system in noise dominated environments

Contact Info

Product

Resources

About