This work was supported by Huawei Technologies through the project "Activity Sensing Technologies for Mobile Users."ABSTRACT Transportation and locomotion mode recognition from multimodal smartphone sensors is useful for providing just-in-time context-aware assistance. However, the field is currently held back by the lack of standardized datasets, recognition tasks, and evaluation criteria. Currently, the recognition methods are often tested on the ad hoc datasets acquired for one-off recognition problems and with different choices of sensors. This prevents a systematic comparative evaluation of methods within and across research groups. Our goal is to address these issues by: 1) introducing a publicly available, largescale dataset for transportation and locomotion mode recognition from multimodal smartphone sensors; 2) suggesting 12 reference recognition scenarios, which are a superset of the tasks we identified in the related work; 3) suggesting relevant combinations of sensors to use based on energy considerations among accelerometer, gyroscope, magnetometer, and global positioning system sensors; and 4) defining precise evaluation criteria, including training and testing sets, evaluation measures, and user-independent and sensorplacement independent evaluations. Based on this, we report a systematic study of the relevance of statistical and frequency features based on the information theoretical criteria to inform recognition systems. We then systematically report the reference performance obtained on all the identified recognition scenarios using a machine-learning recognition pipeline. The extent of this analysis and the clear definition of the recognition tasks enable future researchers to evaluate their own methods in a comparable manner, thus contributing to further advances in the field. The dataset and the code are available online. 1
When a micro aerial vehicle (MAV) captures sounds emitted by a ground or aerial source, its motors and propellers are much closer to the microphone(s) than the sound source, thus leading to extremely low signal-to-noise ratios (SNR), e.g.-15 dB. While microphone-array techniques have been investigated intensively, their application to MAV-based ego-noise reduction has been rarely reported in the literature. To fill this gap, we implement and compare three types of microphonearray algorithms to enhance the target sound captured by an MAV. These algorithms include a newly emerged technique, time-frequency spatial filtering, and two well-known techniques, beamforming and blind source separation. In particular, based on the observation that the target sound and the ego-noise usually have concentrated energy at sparsely isolated timefrequency bins, we propose to use the time-frequency processing approach, which formulates a spatial filter that can enhance a target direction based on local direction of arrival estimates at individual time-frequency bins. By exploiting the time-frequency sparsity of the acoustic signal, this spatial filter works robustly for sound enhancement in the presence of strong ego-noise. We analyze in details the three techniques and conduct a comparative evaluation with real-recorded MAV sounds. Experimental results show the superiority of blind source separation and timefrequency filtering in low-SNR scenarios.
This article fills the gap between the growing interest in signal processing based on Deep Neural Networks (DNN) and the new application of enhancing speech captured by microphones on a drone. In this context, the quality of the target sound is degraded significantly by the strong ego-noise from the rotating motors and propellers. We present the first work that integrates single-channel and multi-channel DNN-based approaches for speech enhancement on drones. We employ a DNN to estimate the ideal ratio masks at individual time-frequency bins, which are subsequently used to design three potential speech enhancement systems, namely singlechannel ego-noise reduction (DNN-S), multi-channel beamforming (DNN-BF), and multi-channel time-frequency spatial filtering (DNN-TF). The main novelty lies in the proposed DNN-TF algorithm, which infers the noise-dominance probabilities at individual time-frequency bins from the DNN-estimated soft masks, and then incorporates them into a time-frequency spatial filtering framework for ego-noise reduction. By jointly exploiting the direction of arrival of the target sound, the time-frequency sparsity of the acoustic signals (speech and ego-noise) and the time-frequency noise-dominance probability, DNN-TF can suppress the ego-noise effectively in scenarios with very low signal-to-noise ratios (e.g. SNR lower than −15 dB), especially when the direction of the target sound is close to that of a source of the ego-noise. Experiments with real and simulated data show the advantage of DNN-TF over competing methods, including DNN-S, DNN-BF and the state-of-the-art time-frequency spatial filtering.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.