Using Attention Networks and Adversarial Augmentation for Styrian Dialect Continuous Sleepiness and Baby Sound Recognition

Yeh, Sung-Lin; Chao, Gao-Yi; Su, Bo-Hao; Huang, Yuping; Lin, Ming-Huei; Tsai, Yin-Chun; Tai, Yu-Wen; Lu, Zheng-Chi; Chen, Chieh-Yu; Tai, Tsung‐Ming; Tseng, Chiu-Wang; Lee, Cheng‐Kuang; Lee, Chi-Chun

doi:10.21437/interspeech.2019-2110

Cited by 13 publications

(10 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(Note that this score was improved to .387 by training ensembles of classifiers [20].) In [21], the authors employ attention networks and adversarial augmentation, in the end, their best results (.369 of CC on test) are achieved by a fusion of neural network models. In [22], a .367 of CC was obtained by an early fusion of the learnt representations from attention and sequence to sequence autoencoders.…”

Section: Resultsmentioning

confidence: 99%

Deep Neural Network Embeddings for the Estimation of the Degree of Sleepiness

Egas-López

Gosztolya

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Estimating the degree of sleepiness from the human speech is an emerging research problem with straightforward applications. In this study, we employ the x-vector approach, currently the state-of-the-art in speaker recognition, as a neural network feature extractor to detect the level of sleepiness of a speaker. Besides using different corpora for fitting the xvector DNN, we also experiment with adding noise and reverberation to the training samples. According to our experimental results for the publicly available Dusseldorf Sleepy Language Corpus, utilizing x-vector embeddings as features for Support Vector Regression consistently leads to competitive performance scores in sleepiness detection. In particular, we present the highest Spearman's correlation coefficient on the public corpus that was achieved by a single method.

show abstract

Section: Resultsmentioning

confidence: 99%

Deep Neural Network Embeddings for the Estimation of the Degree of Sleepiness

Egas-López

Gosztolya

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…However, notice that the automated method does lead to much better performance than Zooniverse classifications for Crying, and to some improvements in Canonical and Junk. For instance, the team who won the challenge in 2019 improved UAR by about 2%, primarily through gains in the laughing class obtained by adding training data (Yeh et al, 2019). That state of the art was challenged by Kaya, Verkholyak, Markitantov, and Karpov (2020), who obtained a UAR of 61% on the same data as the challenge, thanks to improvements in all of the classes but for Laughing.…”

Section: Discussionmentioning

confidence: 99%

Describing vocalizations in young children: A big data approach through citizen science annotation

semenzin¹,

Hamrick²,

Seidl³

et al. 2020

Preprint

View full text Add to dashboard Cite

Recording young children's vocalizations through wearables is a promising method. However, accurately and rapidly annotating these files remains challenging. Online crowdsourcing with the collaboration of citizen scientists could be a feasible solution. In this paper, we assess the extent to which citizen scientists' annotations align with those gathered in the lab for recordings collected from young children. Segments identified by LENA^TM^ as produced by the key child were extracted from one daylong recording for each of 20 participants: 10 low-risk control children and 10 children diagnosed with Angelman syndrome, a neurogenetic syndrome characterized by severe language impairments. Speech samples were annotated by trained annotators in the laboratory as well as by citizen scientists on Zooniverse. All annotators assigned one of five labels to each sample: Canonical, Non-Canonical, Crying, Laughing, and Junk. This allowed the derivation of two child-level vocalization metrics: the Linguistic Proportion, and the Canonical Proportion. At the segment level, Zooniverse classifications had moderate precision and recall. More importantly, the Linguistic Proportion and the Canonical Proportion derived from Zooniverse annotations were highly correlated with those derived from laboratory annotations. Annotations obtained through a citizen science platform can help us overcome challenges posed by the process of annotating daylong speech recordings. Particularly when used in composites or derived metrics, such annotations can be used to investigate early markers of language delays in non-typically developing children.

show abstract

“…However, notice that the automated method does lead to much better performance than Zooniverse classifications for Crying, and to some improvements in Canonical and Junk. For instance, the team who won the challenge in 2019 improved UAR by about 2%, primarily through gains in the laughing class obtained by adding training data (Yeh et al, 2019). That state of the art was challenged by Kaya et al (2020), who obtained a UAR of 61% on the same data as the challenge, thanks to improvements in all of the classes but for Laughing.…”

Section: Further Research Directionsmentioning

confidence: 99%

Describing Vocalizations in Young Children: A Big Data Approach Through Citizen Science Annotation

Semenzin

Hamrick

Seidl

et al. 2021

J Speech Lang Hear Res

View full text Add to dashboard Cite

Purpose Recording young children's vocalizations through wearables is a promising method to assess language development. However, accurately and rapidly annotating these files remains challenging. Online crowdsourcing with the collaboration of citizen scientists could be a feasible solution. In this article, we assess the extent to which citizen scientists' annotations align with those gathered in the lab for recordings collected from young children. Method Segments identified by Language ENvironment Analysis as produced by the key child were extracted from one daylong recording for each of 20 participants: 10 low-risk control children and 10 children diagnosed with Angelman syndrome, a neurogenetic syndrome characterized by severe language impairments. Speech samples were annotated by trained annotators in the laboratory as well as by citizen scientists on Zooniverse. All annotators assigned one of five labels to each sample: Canonical, Noncanonical, Crying, Laughing, and Junk. This allowed the derivation of two child-level vocalization metrics: the Linguistic Proportion and the Canonical Proportion. Results At the segment level, Zooniverse classifications had moderate precision and recall. More importantly, the Linguistic Proportion and the Canonical Proportion derived from Zooniverse annotations were highly correlated with those derived from laboratory annotations. Conclusions Annotations obtained through a citizen science platform can help us overcome challenges posed by the process of annotating daylong speech recordings. Particularly when used in composites or derived metrics, such annotations can be used to investigate early markers of language delays.

show abstract

Using Attention Networks and Adversarial Augmentation for Styrian Dialect Continuous Sleepiness and Baby Sound Recognition

Cited by 13 publications

References 21 publications

Deep Neural Network Embeddings for the Estimation of the Degree of Sleepiness

Deep Neural Network Embeddings for the Estimation of the Degree of Sleepiness

Describing vocalizations in young children: A big data approach through citizen science annotation

Describing Vocalizations in Young Children: A Big Data Approach Through Citizen Science Annotation

Contact Info

Product

Resources

About