Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features

Salishev, Sergey I.; Barabanov, Andrey E.; Kocharov, Daniil; Skrelin, Pavel A.; Moiseev, Mikhail

doi:10.1007/978-3-319-45510-5_40

Cited by 9 publications

(5 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the target of the VAD system is to remove as much information-shallow data from the audio data as possible, we compare several approaches here: at a first level, we try to filter for all vocalisations with general VAD systems, one specifically trained on our data set, the other one being an implementation of the WebRTC VAD system 3 (Google, 2021), commonly used as a comparison for other VAD systems, e.g., (Salishev et al, 2016;Nahar and Kai, 2020). The aggressiveness score of the WebRTC VAD is set equal to one.…”

Section: Voice Activity Detectionmentioning

confidence: 99%

Evaluating the Impact of Voice Activity Detection on Speech Emotion Recognition for Autistic Children

Milling

Baird

Bartl-Pokorny

et al. 2022

Front. Comput. Sci.

View full text Add to dashboard Cite

Individuals with autism are known to face challenges with emotion regulation, and express their affective states in a variety of ways. With this in mind, an increasing amount of research on automatic affect recognition from speech and other modalities has recently been presented to assist and provide support, as well as to improve understanding of autistic individuals' behaviours. As well as the emotion expressed from the voice, for autistic children the dynamics of verbal speech can be inconsistent and vary greatly amongst individuals. The current contribution outlines a voice activity detection (VAD) system specifically adapted to autistic children's vocalisations. The presented VAD system is a recurrent neural network (RNN) with long short-term memory (LSTM) cells. It is trained on 130 acoustic Low-Level Descriptors (LLDs) extracted from more than 17 h of audio recordings, which were richly annotated by experts in terms of perceived emotion as well as occurrence and type of vocalisations. The data consist of 25 English-speaking autistic children undertaking a structured, partly robot-assisted emotion-training activity and was collected as part of the DE-ENIGMA project. The VAD system is further utilised as a preprocessing step for a continuous speech emotion recognition (SER) task aiming to minimise the effects of potential confounding information, such as noise, silence, or non-child vocalisation. Its impact on the SER performance is compared to the impact of other VAD systems, including a general VAD system trained from the same data set, an out-of-the-box Web Real-Time Communication (WebRTC) VAD system, as well as the expert annotations. Our experiments show that the child VAD system achieves a lower performance than our general VAD system, trained under identical conditions, as we obtain receiver operating characteristic area under the curve (ROC-AUC) metrics of 0.662 and 0.850, respectively. The SER results show varying performances across valence and arousal depending on the utilised VAD system with a maximum concordance correlation coefficient (CCC) of 0.263 and a minimum root mean square error (RMSE) of 0.107. Although the performance of the SER models is generally low, the child VAD system can lead to slightly improved results compared to other VAD systems and in particular the VAD-less baseline, supporting the hypothesised importance of child VAD systems in the discussed context.

show abstract

Section: Voice Activity Detectionmentioning

confidence: 99%

Evaluating the Impact of Voice Activity Detection on Speech Emotion Recognition for Autistic Children

Milling

Baird

Bartl-Pokorny

et al. 2022

Front. Comput. Sci.

View full text Add to dashboard Cite

show abstract

“…A VAD is commonly applied for speech and speaker recognition tasks, as well as for telephony, while VAD is more recently referenced in speaker diarization research [65]. Typically, VAD involves using statistical models and short-term energy-based features [66]. Some works in the reviewed literature briefly describe utilization of their own VAD involving on a threshold-based technique dependent on the dataset used [13,27,31,48].…”

Section: Voice Activity Detectionmentioning

confidence: 99%

“…The aggressiveness of the module can be set from a scale of 0-3, with 0 being the least aggressive at filtering out non-speech frames. The WebRTC VAD has been referenced in the work by Stoter et al for speaker counting [10] and in literature comparing VAD techniques [66]. This VAD module was experimented with as a Python library although the results were not substantial as the VAD was unable to detect the presence of non-speech from the data despite being regarded as a state-of-the-art module.…”

Section: Voice Activity Detectionmentioning

confidence: 99%

Multimodal System for Audio Scene Source Counting and Analysis

Nigro

Krishnan²

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

This thesis explores audio scene analysis (ASA) for determining the number of active sources in an audio scene, a task that is defined as audio source counting. A first of its kind dataset called SARdB is produced with audio and text modalities, and annotations for the number of speakers and the number of sound events present in an audio recording. For speaker counting, an audio-based ResNet-34 and text-based Bidirectional Long Short-Term Memory (BLSTM) network set a baseline prediction accuracy of 46.03% and 89.57% when considering a margin of error of one speaker, while outperforming various state-of-the-art systems in speaker counting. Another audio-based ResNet-34 model demonstrates the optimal result for sound event counting at 50.55% prediction accuracy and 86.59% accuracy with a margin of error of one sound event. The proposed method for source counting is also shown to perform in real-time with an overall processing time of ∼0.4614s.

show abstract

“…We used WebRTC [22] to perform A-VAD. WebRTC employs multiple-frequency (subband) features combined with a pre-trained GMM classifier.…”

Section: B Audio Voice Activity Detectionmentioning

confidence: 99%

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Guy

Lathuilière

Mesejo

et al. 2021

2020 25th International Conference on Pattern Recognition (ICPR)

View full text Add to dashboard Cite

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets inthe-wild -WildVVAD -based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset. 1

show abstract

Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features

Cited by 9 publications

References 6 publications

Evaluating the Impact of Voice Activity Detection on Speech Emotion Recognition for Autistic Children

Evaluating the Impact of Voice Activity Detection on Speech Emotion Recognition for Autistic Children

Multimodal System for Audio Scene Source Counting and Analysis

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Contact Info

Product

Resources

About