Evaluating the Impact of Voice Activity Detection on Speech Emotion Recognition for Autistic Children

Milling, Manuel; Baird, Alice; Bartl-Pokorny, Katrin D.; Liu, Shuo; Alcorn, Alyssa M.; Shen, Jie; Tavassoli, Teresa; Ainger, Eloise; Pellicano, Elizabeth; Pantić, Maja; Cummins, Nicholas; Schuller, Björn W.

doi:10.3389/fcomp.2022.837269

Cited by 9 publications

(12 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As suggested by Milling et al (2022), the performance of VAD systems can have an impact on the estimation of affective states based on audio recordings. This hypothesis is supported by the clear differences in valence and arousal predictions in Table 5, when comparing the average absolute difference between VAD-based SER predictions and the non-VAD-based SER predictions, in particular for the arousal case.…”

Section: Discussionmentioning

confidence: 99%

“…The model was trained on audio data from the British study arm of the project containing audio data from 84 robot-supported intervention sessions of 25 English-speaking children with ASD from the United Kingdom, having a total duration of more than 17 hours. For the experiments reported by Milling et al (2022), the data was partitioned into train, development, and test partitions in a speaker-independent way. In total, this data contains more than 2 hours of annotated child vocalisations (partially overlapping with other vocalisations).…”

Section: Methodsmentioning

confidence: 99%

“…Finally, the VAD system summarises frame-level predictions with a detection event if the EER threshold is reached for at least 0.25 s of a 1-s-chunk. More details on the VAD are given in Milling et al (2022). The child VAD system has a reported EER of 0.381 and an area-under-the-curve of 0.662 on the DE-ENIGMA corpus.…”

Section: Methodsmentioning

confidence: 99%

“…In order to estimate emotions from speech we use a deep learning model developed, trained, and evaluated by Milling et al (2022), which is again based on the same DE-ENIGMA data and partitioning of British children, as described for the VAD component. The models are designed for a continuous speech emotion recognition (SER) task and we apply correspondingly trained models for two types of tasks: 1) each 1 s chunk from the original audio recording is used for the continuous SER task and 2) only 1 s chunks, for which a child vocalisation was detected, are being used for the continuous SER task.…”

Section: Methodsmentioning

confidence: 99%

“…Reported performances of the models achieve up to 0.168 concordance correlation coefficient (CCC) on the test partition of the DE-ENIGMA corpus. More details on the SER system and performance values can be found in Milling et al (2022).…”

Section: Methodsmentioning

confidence: 99%

See 4 more Smart Citations

Investigating Automatic Speech Emotion Recognition for Children with Autism Spectrum Disorder in interactive intervention sessions with the social robot Kaspar

Milling

Bartl-Pokorny

Schuller

2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this contribution, we present the analyses of vocalisation data recorded in the first observation round of the European Commission's Erasmus Plus project "EMBOA, Affective loop in Socially Assistive Robotics as an intervention tool for children with autism". In total, the project partners recorded data in 112 robot-supported intervention sessions for children with autism spectrum disorder. Audio data were recorded using the internal and lapel microphone of the H4n Pro Recorder. To analyse the data, we first utilise a child voice activity detection (VAD) system in order to extract child vocalisations from the raw audio data. For each child, session, and microphone, we provide the total time child vocalisations were detected. Next, we compare the results of two different implementations for valence- and arousal-based speech emotion recognition, thereby processing (1) the child vocalisations detected by the VAD and (2) the total recorded audio material. We provide average valence and arousal values for each session and condition. Finally, we discuss challenges and limitations of child voice detection and audio-based emotion recognition in robot-supported intervention settings.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 3 more Smart Citations