Paolo Vecchiotti scite author profile

Computer Speech & Language

et al. 2018

Deep Neural Networks for Joint Voice Activity Detection and Speaker Localization

Squartini

et al. 2018

Deep neural networks for Multi-Room Voice Activity Detection: Advancements and comparative evaluation

Vesperini

et al. 2016

Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation

Pepe

Expert Systems with Applications

et al. 2019

The task of Speaker LOCalization (SLOC) has been the focus of numerous works in the research field, where SLOC is performed on pure speech data, requiring the presence of an Oracle Voice Activity Detection (VAD) algorithm. Nevertheless, this perfect working condition is not satisfied in a real world scenario, where employed VADs do commit errors. This work addresses this issue with an extensive analysis focusing on the relationship between several datadriven VAD and SLOC models, finally proposing a reliable framework for VAD and SLOC. The effectiveness of the approach here discussed is assessed against a multi-room scenario, which is close to a real world environment. Furthermore, up to the authors' best knowledge, only one contribution proposes a unique framework for VAD and SLOC acting in this addressed scenario; however this solution does not rely on data-driven approaches. This work comes as an extension of the authors' previous research addressing the VAD and SLOC tasks, by proposing numerous advancements to the original neural network architectures. In details, four different models based on convolutional neural networks (CNNs) are here tested, in order to easily highlight the advantages of the introduced novelties. In addition, two different CNN models go under study for SLOC. Furthermore, training of data-driven models is here improved through a specific data augmentation technique. During this procedure, the room impulse responses (RIRs) of two virtual rooms are generated from the knowledge of the room size, reverberation time and microphones and sources placement. Finally, the only other framework for simultaneous detection and localization in a multi-room scenario is here taken into account to fairly compare the proposed method. As result, the proposed method shows to be more accurate than the baseline framework, and remarkable improvements are specially observed when the data

show abstract

Convolutional Neural Networks with 3-D Kernels for Voice Activity Detection in a Multiroom Environment

Vesperini

et al. 2017

This paper focuses on employing Convolutional Neural Networks (CNN) with 3-D kernels for Voice Activity Detectors in multi-room domestic scenarios (mVAD). This technology is compared with the Multi Layer Perceptron (MLP) and interesting advancements are observed with respect to previous works of the authors. In order to approximate real-life scenarios, the DIRHA dataset is exploited. It has been recorded in a home environment by means of several microphones arranged in various rooms. Our study is composed by a multi-stage analysis focusing on the selection of the network size and the input microphones in relation with their number and position. Results are evaluated in terms of Speech Activity Detection error rate (SAD). The CNN-mVAD outperforms the other method with a significant solidity in terms of performance statistics, achieving in the best overall case a SAD equal to 7.0%.

show abstract

End-to-end Binaural Sound Localisation from the Raw Waveform

Vecchiotti¹,

Squartini²,

Brown³

2019

Preprint