This paper describes a novel approach for localization of multiple sources overlapping in time. The proposed algorithm relies on acoustic maps computed in multi-microphone settings, which are descriptions of the distribution of the acoustic activity in a monitored area. Through a proper processing of the acoustic maps, the positions of two or more simultaneously active acoustic sources can be estimated in a robust way. Experimental results obtained on real data collected for this specific task show the capabilities of the given method both with distributed microphone networks and with compact arrays.
We propose an audiovisual fusion algorithm for 3D speaker tracking from a localised multi-modal sensor platform composed of a camera and a small microphone array. After extracting audiovisual cues from individual modalities we fuse them adaptively using their reliability in a particle filter framework. The reliability of the audio signal is measured based on the maximum Global Coherence Field (GCF) peak value at each frame. The visual reliability is based on colour-histogram matching with detection results compared with a reference image in the RGB space. Experiments on the AV16.3 dataset show that the proposed adaptive audiovisual tracker outperforms both the individual modalities and a classical approach with fixed parameters in terms of tracking accuracy.
An interface for distant-talking control of home devices requires the possibility of identifying the positions of multiple users. Acoustic maps, based either on Global Coherence Field (GCF) or Oriented Global Coherence Field (OGCF), have already been exploited successfully to determine position and head orientation of a single speaker. This paper proposes a new method using acoustic maps to deal with the case of two simultaneous speakers. The method is based on a two step analysis of a coherence map: first the dominant speaker is localized; then the map is modified by compensating for the effects due to the first speaker and the position of the second speaker is detected. Simulations were carried out to show how an appropriate analysis of OGCF and GCF maps allows one to localize both speakers. Experiments proved the effectiveness of the proposed solution in a linear microphone array set up.Index Terms-microphone array, speaker localization, multiple speakers, global coherence field.
Outdoor acoustic event detection is an exciting research field but challenged by the need for complex algorithms and deep learning techniques, typically requiring many computational, memory, and energy resources. These challenges discourage IoT implementations, where an efficient use of resources is required. However, current embedded technologies and microcontrollers have increased their capabilities without penalizing energy efficiency. This paper addresses the application of sound event detection at the very edge, by optimizing deep learning techniques on resource-constrained embedded platforms for the IoT. The contribution is two-fold: firstly, a two-stage student-teacher approach is presented to make state-of-theart neural networks for sound event detection fit on current microcontrollers; secondly, we test our approach on an ARM Cortex M4, particularly focusing on issues related to 8-bits quantization. Our embedded implementation can achieve 68% accuracy in recognition on Urbansound8k, not far from state-ofthe-art performance, with an inference time of 125 ms for each second of the audio stream, and power consumption of 5.5 mW in just 34.3 kB of RAM.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.