2019
DOI: 10.1016/j.eswa.2019.05.017
|View full text |Cite
|
Sign up to set email alerts
|

Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation

Abstract: The task of Speaker LOCalization (SLOC) has been the focus of numerous works in the research field, where SLOC is performed on pure speech data, requiring the presence of an Oracle Voice Activity Detection (VAD) algorithm. Nevertheless, this perfect working condition is not satisfied in a real world scenario, where employed VADs do commit errors. This work addresses this issue with an extensive analysis focusing on the relationship between several datadriven VAD and SLOC models, finally proposing a reliable fr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 35 publications
0
8
0
Order By: Relevance
“…In addition, we explore the use of spatial features to aid VAD+OSD and speaker counting. As mentioned above, a number of works have shown that spatial features can be used for counting (Drude et al, 2014;Pasha et al, 2017;Brutti et al, 2010;Pavlidi et al, 2012) and VAD (Vecchiotti et al, 2019b). However, to our knowledge, no study has yet been performed where spatial features are used in conjunction with deep neural networks to tackle OSD and speaker counting directly.…”
Section: Our Contributionmentioning
confidence: 99%
“…In addition, we explore the use of spatial features to aid VAD+OSD and speaker counting. As mentioned above, a number of works have shown that spatial features can be used for counting (Drude et al, 2014;Pasha et al, 2017;Brutti et al, 2010;Pavlidi et al, 2012) and VAD (Vecchiotti et al, 2019b). However, to our knowledge, no study has yet been performed where spatial features are used in conjunction with deep neural networks to tackle OSD and speaker counting directly.…”
Section: Our Contributionmentioning
confidence: 99%
“…Both techniques are widely used thanks to their high accuracy and relatively low processing cost. For instance, the authors in [24], [26], [27], [44], [49], and [51] use CNN and RNN to estimate the pedestrian localization, improving the fingerprint creation process, using different data-training, and reducing the noise effects. The best result shows an improvement on accuracy by 75% when compared to the pedestrian localization system without the use of ML.…”
Section: A Machine Learning In Scene Analysismentioning
confidence: 99%
“…In recent years, researchers have shown that the most effective tools for the classification of sound events include the application of deep, convolutional, and recurrent neural networks (DNN, CNN, and RNN) [7], [8], [3], [9], [4]. However, for the current work, the concern with the processing time of the algorithms is fundamental, since, among the future goals, the aim is to create a low-cost system capable of running in real-time.…”
Section: Literature Reviewmentioning
confidence: 99%
“…Recent work advocates the use of DNN and CNN can perceive patterns in auditors without using many features [9], [7]. Both were able to acquire good results using only Mel Frequency Cepstral Coefficients (MFCC).…”
Section: Literature Reviewmentioning
confidence: 99%
See 1 more Smart Citation