DOANet: a deep dilated convolutional neural network approach for search and rescue with drone-embedded sound source localization

Qayyum, Alif Bin Abdul; Hassan, Kanza; Anika, Adrita; Shadiq, Md. Farhan; Rahman, M. Sohel; Islam, Md. Tariqul; Imran, Sheikh Asif; Hossain, Shahruk; Haque, Mohammad Ariful

doi:10.1186/s13636-020-00184-2

Cited by 7 publications

(4 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The input STFT is calculated using a window length of 2048 and a half overlap. The 10 dilated layers aggregate information across the frequency dimension: they have kernel size (3, 1) and dilation factor 2(d − 1), where d ∈ [1,10] denotes the depth of the layer. The 3 non-dilated layers aggregate information across both time and frequency dimensions, with a kernel size (3,3).…”

Section: A Smolnetmentioning

confidence: 99%

Deep Learning Models for Single-Channel Speech Enhancement on Drones

et al. 2023

View full text Add to dashboard Cite

Speech enhancement for drone audition is made challenging by the strong ego-noise from the rotating motors and propellers, which leads to extremely low signal-to-noise ratios (e.g. SNR < -15 dB) at onboard microphones. In this paper, we extensively assess the ability of single-channel deep learning approaches to ego-noise reduction on drones. We train twelve representative deep neural network (DNN) models, covering three operation domains (time-frequency magnitude domain, time-frequency complex domain and end-to-end time domain) and three distinct architectures (sequential, encoder-decoder and generative). We critically discuss and compare the performance of these models in extremely low-SNR scenarios, ranging from -30 to 0 dB. We show that time-frequency complex domain and UNet encoderdecoder architectures outperform other approaches on speech enhancement measures while providing a good trade-off with other criteria, such as model size, computation complexity and context length. Specifically, the best-performing model is DCUNet, a UNet model operating in the time-frequency complex domain, which, at input SNR -15 dB, improves ESTOI from 0.1 to 0.4, PESQ from 1.0 to 1.9 and SI-SDR from -15 dB to 3.7 dB. Based on the insights drawn from these findings, we discuss future research in drone ego-noise reduction.

show abstract

Section: A Smolnetmentioning

confidence: 99%

Deep Learning Models for Single-Channel Speech Enhancement on Drones

et al. 2023

View full text Add to dashboard Cite

show abstract

“…(22) The bases now hold N/4 + 1 elements (instead of N/2 + 1) and all the elements of p(k) are purely real or imaginary numbers. Computing (22) involves K(N/2 + 2) real multiplications and KN/2 real additions, for a total of K(N +2) flops. Computing the vectors x add (t) and x sub (t) also adds N flops, which leads to a total of K(N + 2) + N flops.…”

Section: Generalized Cross-correlationmentioning

confidence: 99%

“…DoA can also solve the permutation ambiguity in speech separation tasks [12] with multiple microphones, in deep clustering for instance [13], [14], [15] or time-frequency masking [16], [17], [18]. SSL can also serve numerous applications in robotics [19], ranging from acoustic synchronous localization and mapping (SLAM) [20], rescue missions [21], [22], drones tracking [23], [24], [25] and assisting humans with hearing impairments [26]. Some frameworks have been proposed over the years to perform online SSL on robots [27], [28].…”

Section: Introductionmentioning

confidence: 99%

Fast Cross-Correlation for TDoA Estimation on Small Aperture Microphone Arrays

Grondin¹,

Marc-Antoine²,

Lauzon³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper introduces the Fast Cross-Correlation (FCC) method for Time Difference of Arrival (TDoA) Estimation for pairs of microphones on a small aperture microphone array. FCC relies on low-rank decomposition and exploits symmetry in even and odd bases to speed up computation while preserving TDoA accuracy. FCC reduces the number of flops by a factor of 4.5 and the execution speed by factors of 8.2, 2.6 and 2.7 on a Raspberry Pi Zero, a Raspberry Pi 4 and a Nvidia Jetson TX2 devices, respectively, compared to the state-of-the-art Generalized Cross-Correlation (GCC) method that relies on the Fast Fourier Transform (FFT). This improvement can provide portable microphone arrays with extended battery life and allow real-time processing on low-cost hardware.

show abstract

“…DNN sound enhancement is typically a pre-processing step for traditional source localization algorithms [36]. While a DNN can also be trained to predict the location of the sound source directly from the multi-channel microphone signal, the performance typically drops significantly in low-SNR scenarios [37].…”

Section: Introductionmentioning

confidence: 99%

Deep-Learning-Assisted Sound Source Localization From a Flying Drone

Wang

Cavallaro

2022

IEEE Sensors J.

View full text Add to dashboard Cite

Sound source localization from a flying drone is a challenging task due to the strong ego-noise from rotating motors and propellers as well as the movement of the drone and the sound sources. To address this challenge, we propose a deep learning-based framework that integrates single-channel noise reduction and multi-channel source localization. In this framework we suppress the ego-noise and estimate a time-frequency soft ratio mask with a single-channel deep neural network (DNN). Then we design two downstream multi-channel source localization algorithms, based on Steered Response Power (SRP-DNN) and Time-Frequency Spatial filtering (TFS-DNN). The main novelty lies in the proposed TFS-DNN approach, which estimates the presence probability of the target sound at individual time-frequency bins by combining the DNN-inferred soft ratio mask and the instantaneous direction of arrival of the sound received by the microphone array. The time-frequency presence probability of the target sound is then used to design a set of spatial filters to construct a spatial likelihood map for source localization. By jointly exploiting spectral and spatial information, TFS-DNN robustly processes signals in short segments (e.g. 0.5 seconds) in dynamic and low signal-noise-ratio scenarios (e.g. SNR -20 dB). Results on real and simulated data in a variety of scenarios (static sources, moving sources and moving drones) indicate the advantage of TFS-DNN over competing methods, including SRP-DNN and the state-of-the-art time-frequency spatial filtering.

show abstract

DOANet: a deep dilated convolutional neural network approach for search and rescue with drone-embedded sound source localization

Cited by 7 publications

References 23 publications

Deep Learning Models for Single-Channel Speech Enhancement on Drones

Deep Learning Models for Single-Channel Speech Enhancement on Drones

Fast Cross-Correlation for TDoA Estimation on Small Aperture Microphone Arrays

Deep-Learning-Assisted Sound Source Localization From a Flying Drone

Contact Info

Product

Resources

About