Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation

Vecchiotti, Paolo; Pepe, Giovanni; Principi, Emanuele; Squartini, Stefano

doi:10.1016/j.eswa.2019.05.017

Cited by 11 publications

(8 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, we explore the use of spatial features to aid VAD+OSD and speaker counting. As mentioned above, a number of works have shown that spatial features can be used for counting (Drude et al, 2014;Pasha et al, 2017;Brutti et al, 2010;Pavlidi et al, 2012) and VAD (Vecchiotti et al, 2019b). However, to our knowledge, no study has yet been performed where spatial features are used in conjunction with deep neural networks to tackle OSD and speaker counting directly.…”

Section: Our Contributionmentioning

confidence: 99%

Overlapped Speech Detection and speaker counting using distant microphone arrays

Cornell

Omologo

Squartini

2022

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

We study the problem of detecting and counting simultaneous, overlapping speakers in a multichannel, distant-microphone scenario. Focusing on a supervised learning approach, we treat Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), joint VAD and OSD (VAD+OSD) and speaker counting in a unified way, as instances of a general Overlapped Speech Detection and Counting (OSDC) multi-class supervised learning problem. We consider a Temporal Convolutional Network (TCN) and a Transformer based architecture for this task, and compare them with previously proposed state-of-the art methods based on Recurrent Neural Networks (RNN) or hybrid Convolutional-Recurrent Neural Networks (CRNN). In addition, we propose ways of exploiting multichannel input by means of early or late fusion of single-channel features with spatial features extracted from one or more microphone pairs. We conduct an extensive experimental evaluation on the AMI and CHiME-6 datasets and on a purposely made multichannel synthetic dataset. We show that the Transformer-based architecture performs best among all architectures and that neural network based spatial localization features outperform signal-based spatial features and significantly improve performance compared to single-channel features only. Finally, we find that training with a speaker counting objective improves OSD compared to training with a VAD+OSD objective.

show abstract

Section: Our Contributionmentioning

confidence: 99%

Overlapped Speech Detection and speaker counting using distant microphone arrays

Cornell

Omologo

Squartini

2022

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

show abstract

“…Both techniques are widely used thanks to their high accuracy and relatively low processing cost. For instance, the authors in [24], [26], [27], [44], [49], and [51] use CNN and RNN to estimate the pedestrian localization, improving the fingerprint creation process, using different data-training, and reducing the noise effects. The best result shows an improvement on accuracy by 75% when compared to the pedestrian localization system without the use of ML.…”

Section: A Machine Learning In Scene Analysismentioning

confidence: 99%

A Survey of Machine Learning in Pedestrian Localization Systems: Applications, Open Issues and Challenges

et al. 2021

View full text Add to dashboard Cite

With the popularization of machine learning (ML) techniques and the increased chipset's performance, the application of ML to pedestrian localization systems has received significant attention in the last years. Several survey papers have attempted to provide a state-of-the-art overview, but they usually limit their scope to a particular type of positioning system or technology. In addition, they are written from the point of view of ML techniques and their practice, not from the point of view of the localization system and the specific problems that ML techniques can help to solve. This article is intended to offer a comprehensive state-of-the-art survey of the ML techniques that have been adopted over the last ten years to improve the performance of pedestrian localization systems, addressing the applicability of ML techniques in this domain, along with the main localization strategies. It concludes by indicating the underlying open issues and challenges associated with the existing systems, and possible future directions in which ML techniques could improve the performance of pedestrian localization systems. Among other open issues, most previous authors have focused their attention on position estimation accuracy, which wastes the potential of ML techniques to improve other performance parameters (e.g., response time, computational complexity, robustness, scalability or energy efficiency). This study shows that there is a strong trend towards the application of supervised learning. Consequently, there are many potential research opportunities in the use of other learning types, such as unsupervised and reinforcement learning, to improve the performance of pedestrian localization systems.

show abstract

“…In recent years, researchers have shown that the most effective tools for the classification of sound events include the application of deep, convolutional, and recurrent neural networks (DNN, CNN, and RNN) [7], [8], [3], [9], [4]. However, for the current work, the concern with the processing time of the algorithms is fundamental, since, among the future goals, the aim is to create a low-cost system capable of running in real-time.…”

Section: Literature Reviewmentioning

confidence: 99%

“…Recent work advocates the use of DNN and CNN can perceive patterns in auditors without using many features [9], [7]. Both were able to acquire good results using only Mel Frequency Cepstral Coefficients (MFCC).…”

Section: Literature Reviewmentioning

confidence: 99%

“…This causes directivity patterns to be created, amplifying signals from a certain direction, and attenuating from others [15], [16]. To find the lags of each microphone concerning the referential, auto-correlation analysis in the time or frequency domain can be performed [17], [7]. The so-called Generalized Sidelobe Canceller differs from the previous one for being adaptive.…”

Section: A Beamformingmentioning

confidence: 99%

See 1 more Smart Citation

Microphone Array Based Surveillance Audio Classification

Silva¹,

Spadini²,

Suyama³

2020

Anais De XXXVIII Simpósio Brasileiro De Telecomunicações E Processamento De Sinais

View full text Add to dashboard Cite

The work assessed seven classifiers and two beamforming algorithms for detecting surveillance sound events. The tests included the use of AWGN with -10 dB to 30 dB SNR and Data Augmentation (DA). The results showed that the combination of Support Vector Machine (SVM) and Delay-and-Sum (DaS) scored the best accuracy (up to 86.0%), but had high computational cost (≈ 79 ms), mainly due to DaS and DA. The use of Stochastic Gradient Descent (SGD) also seems to be a good alternative since it has achieved good accuracy either (up to 85.3%), but with quicker processing time (≈ 25 ms).

show abstract

Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation

Cited by 11 publications

References 35 publications

Overlapped Speech Detection and speaker counting using distant microphone arrays

Overlapped Speech Detection and speaker counting using distant microphone arrays

A Survey of Machine Learning in Pedestrian Localization Systems: Applications, Open Issues and Challenges

Microphone Array Based Surveillance Audio Classification

Contact Info

Product

Resources

About