Automatic speaker profiling from short duration speech data

Kalluri, Shareef Babu; Vijayasenan, Deepu; Ganapathy, Sriram

doi:10.1016/j.specom.2020.03.008

Cited by 25 publications

(27 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note, however, that the lengths of utterances in that dataset are much higher then of those in the TIMIT dataset and the authors report much worse results on shorter test segments. These results are also on-par with those recently published in [ 24 ] (5.6 and 5.2 MAE female/male) without using any hand-engineered features and relying solely on the low-level signal representation.…”

Section: Resultssupporting

confidence: 86%

“…On top of that, the gender classification accuracy is competitive with the results achieved by the d-vector system, shown in Table 10 . This results are also the best in terms of MAE out of all proposed solution and better then the current state-of-the-art results shown in [ 24 ] by 0.31 and 0.08 MAE for female and male speakers, respectively.…”

Section: Resultssupporting

confidence: 57%

“…Their results for age estimation are 0.6 years in terms of root mean square error (RMSE), 7.60 and 8.63 years for male and female using the TIMIT dataset [ 23 ]. Finally, in the latest paper from 2020 [ 24 ], the authors propose a feature-engineering based support vector regression system and achieve a state-of-the-art results on the TIMIT dataset, with a mean absolute error (MAE) of 5.2 for males and 5.6 years for female.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks

Kwaśny

Hemmerling

2021

Sensors

View full text Add to dashboard Cite

The speech signal contains a vast spectrum of information about the speaker such as speakers’ gender, age, accent, or health state. In this paper, we explored different approaches to automatic speaker’s gender classification and age estimation system using speech signals. We applied various Deep Neural Network-based embedder architectures such as x-vector and d-vector to age estimation and gender classification tasks. Furthermore, we have applied a transfer learning-based training scheme with pre-training the embedder network for a speaker recognition task using the Vox-Celeb1 dataset and then fine-tuning it for the joint age estimation and gender classification task. The best performing system achieves new state-of-the-art results on the age estimation task using popular TIMIT dataset with a mean absolute error (MAE) of 5.12 years for male and 5.29 years for female speakers and a root-mean square error (RMSE) of 7.24 and 8.12 years for male and female speakers, respectively, and an overall gender recognition accuracy of 99.60%.

show abstract

Section: Resultssupporting

confidence: 86%

Section: Resultssupporting

confidence: 57%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks

Kwaśny

Hemmerling

2021

Sensors

View full text Add to dashboard Cite

show abstract

“…Speaker attribute estimation: In speech fields, various methods that estimate speaker attributes such as gender, age, and height have been studied [19][20][21][22][23][24]. In the last decade, fully neural network based methods have been examined to precisely capture input speech contexts [21][22][23][24]. In fact, multiple-speaker attributes are often jointly estimated via multi-task learning [22,24].…”

Section: Related Workmentioning

confidence: 99%

Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation

Masumura¹,

Okamura²,

Makishima³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we present a novel modeling method for singlechannel multi-talker overlapped automatic speech recognition (ASR) systems. Fully neural network based end-to-end models have dramatically improved the performance of multi-taker overlapped ASR tasks. One promising approach for end-toend modeling is autoregressive modeling with serialized output training in which transcriptions of multiple speakers are recursively generated one after another. This enables us to naturally capture relationships between speakers. However, the conventional modeling method cannot explicitly take into account the speaker attributes of individual utterances such as gender and age information. In fact, the performance deteriorates when each speaker is the same gender or is close in age. To address this problem, we propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation. Our key idea is to handle gender and age estimation tasks within the unified autoregressive modeling. In the proposed method, transformer-based autoregressive model recursively generates not only textual tokens but also attribute tokens of each speaker. This enables us to effectively utilize speaker attributes for improving multi-talker overlapped ASR. Experiments on Japanese multi-talker overlapped ASR tasks demonstrate the effectiveness of the proposed method.

show abstract

“…Vogel and Morgan documented that the length of obtained speech data impacted the measurement accuracy of bio-acoustic features [26]. Although several efforts have been made to explore the accuracy of short-duration speech samples for detecting a disease or estimating a physical parameter [27]- [30], only a few studies have explored the impact of voice sample length on speech characteristics [31]- [33]. Scherer et al have shown that, in sustained vowel tasks, the stability of perturbation measurements, jitter and shimmer, is affected by the task duration.…”

Section: Introductionmentioning

confidence: 99%

The Reproducibility of Bio-Acoustic Features is Associated With Sample Duration, Speech Task, and Gender

Almaghrabi

Thewlis

Thwaites

et al. 2022

IEEE Trans. Neural Syst. Rehabil. Eng.

View full text Add to dashboard Cite

Bio-acoustic properties of speech show evolving value in analyzing psychiatric illnesses. Obtaining a sufficient speech sample length to quantify these properties is essential, but the impact of sample duration on the stability of bio-acoustic features has not been systematically explored. We aimed to evaluate bio-acoustic features' reproducibility against changes in speech durations and tasks. We extracted source, spectral, formant, and prosodic features in 185 English-speaking adults (98 w, 87 m) for reading-a-story and counting tasks. We compared features at 25% of the total sample duration of the reading task to those obtained from non-overlapping randomly selected sub-samples shortened to 75%, 50%, and 25% of total duration using intraclass correlation coefficients. We also compared the features extracted from entire recordings to those measured at 25% of the duration and features obtained from 50% of the duration. Further, we compared features extracted from reading-a-story to counting tasks. Our results show that the number of reproducible features (out of 125) decreased stepwise with duration reduction. Spectral shape, pitch, and formants reached excellent repro-

show abstract

Automatic speaker profiling from short duration speech data

Cited by 25 publications

References 26 publications

Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks

Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks

Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation

The Reproducibility of Bio-Acoustic Features is Associated With Sample Duration, Speech Task, and Gender

Contact Info

Product

Resources

About