Outdoor acoustic event detection is an exciting research field but challenged by the need for complex algorithms and deep learning techniques, typically requiring many computational, memory, and energy resources. These challenges discourage IoT implementations, where an efficient use of resources is required. However, current embedded technologies and microcontrollers have increased their capabilities without penalizing energy efficiency. This paper addresses the application of sound event detection at the very edge, by optimizing deep learning techniques on resource-constrained embedded platforms for the IoT. The contribution is two-fold: firstly, a two-stage student-teacher approach is presented to make state-of-theart neural networks for sound event detection fit on current microcontrollers; secondly, we test our approach on an ARM Cortex M4, particularly focusing on issues related to 8-bits quantization. Our embedded implementation can achieve 68% accuracy in recognition on Urbansound8k, not far from state-ofthe-art performance, with an inference time of 125 ms for each second of the audio stream, and power consumption of 5.5 mW in just 34.3 kB of RAM.
In most classification tasks, wide and deep neural networks perform and generalize better than their smaller counterparts, in particular when they are exposed to large and heterogeneous training sets. However, in the emerging field of Internet of Things memory footprint and energy budget pose severe limits on the size and complexity of the neural models that can be implemented on embedded devices. The Student-Teacher approach is an attractive strategy to distill knowledge from a large network into smaller ones, that can fit on low-energy lowcomplexity embedded IoT platforms. In this paper, we consider the outdoor sound event detection task as a use case. Building upon the VGGish network, we investigate different distillation strategies to substantially reduce the classifier's size and computational cost with minimal performance losses. Experiments on the UrbanSound8K dataset show that extreme compression factors (up to 4.2 • 10 −4 for parameters and 1.2 • 10 −3 for operations with respect to VGGish) can be achieved, limiting the accuracy degradation from 75% to 70%. Finally, we compare different embedded platforms to analyze the trade-off between available resources and achievable accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.