A Deep Neural Network Based End to End Model for Joint Height and Age Estimation from Short Duration Speech

Kalluri, Shareef Babu; Vijayasenan, Deepu; Ganapathy, Sriram

doi:10.1109/icassp.2019.8683397

Cited by 22 publications

(19 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The performance on the gender classification task on the Common Voice dataset with a baseline x-vector embedder is presented in Table 8 . The age estimation RMSE of presented approach is 8.44 and 7.96 years for female and male speakers is comparable to the results reported by the authors at [ 22 ], with 8.63 and 7.60 years female/male. However, the network does not react well to the attempts of using transfer learning—a system pre-trained on Common Voice actually offers worse results then the system with no pre-training.…”

Section: Resultssupporting

confidence: 87%

See 1 more Smart Citation

Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks

Kwaśny

Hemmerling

2021

Sensors

View full text Add to dashboard Cite

The speech signal contains a vast spectrum of information about the speaker such as speakers’ gender, age, accent, or health state. In this paper, we explored different approaches to automatic speaker’s gender classification and age estimation system using speech signals. We applied various Deep Neural Network-based embedder architectures such as x-vector and d-vector to age estimation and gender classification tasks. Furthermore, we have applied a transfer learning-based training scheme with pre-training the embedder network for a speaker recognition task using the Vox-Celeb1 dataset and then fine-tuning it for the joint age estimation and gender classification task. The best performing system achieves new state-of-the-art results on the age estimation task using popular TIMIT dataset with a mean absolute error (MAE) of 5.12 years for male and 5.29 years for female speakers and a root-mean square error (RMSE) of 7.24 and 8.12 years for male and female speakers, respectively, and an overall gender recognition accuracy of 99.60%.

show abstract

Section: Resultssupporting

confidence: 87%

“…The work in [ 22 ] describes a DNN implementation for a joint height and age estimation system. Their results for age estimation are 0.6 years in terms of root mean square error (RMSE), 7.60 and 8.63 years for male and female using the TIMIT dataset [ 23 ].…”

Section: Introductionmentioning

confidence: 99%

Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks

Kwaśny

Hemmerling

2021

Sensors

View full text Add to dashboard Cite

show abstract

“…The second issue is the design of a proper classification model [ 6 , 50 ]. Recently, deep learning models have been applied for age and gender recognition [ 7 ]; however, the aforementioned issues remain unresolved.…”

Section: Discussion and Comparative Analysismentioning

confidence: 99%

“…The constructed x-vector was then used for age estimation based on the speaker speech signal. A unified DNN architecture to recognize both the height and age of a speaker from short durations of speech was also proposed [ 7 ], which improved age estimation by 0.6 years in terms of the root mean square error (RMSE) over the classical SVR. The authors of [ 8 ] proposed a novel age estimation system based on Long short-term memory (LSTM) recurrent neural networks (RNN) that can deal with short utterances using acoustic features.…”

Section: Introductionmentioning

confidence: 99%

Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms

Tursunov

Mustaqeem

Choeh

et al. 2021

Sensors

View full text Add to dashboard Cite

Speech signals are being used as a primary input source in human–computer interaction (HCI) to develop several applications, such as automatic speech recognition (ASR), speech emotion recognition (SER), gender, and age recognition. Classifying speakers according to their age and gender is a challenging task in speech processing owing to the disability of the current methods of extracting salient high-level speech features and classification models. To address these problems, we introduce a novel end-to-end age and gender recognition convolutional neural network (CNN) with a specially designed multi-attention module (MAM) from speech signals. Our proposed model uses MAM to extract spatial and temporal salient features from the input data effectively. The MAM mechanism uses a rectangular shape filter as a kernel in convolution layers and comprises two separate time and frequency attention mechanisms. The time attention branch learns to detect temporal cues, whereas the frequency attention module extracts the most relevant features to the target by focusing on the spatial frequency features. The combination of the two extracted spatial and temporal features complements one another and provide high performance in terms of age and gender classification. The proposed age and gender classification system was tested using the Common Voice and locally developed Korean speech recognition datasets. Our suggested model achieved 96%, 73%, and 76% accuracy scores for gender, age, and age-gender classification, respectively, using the Common Voice dataset. The Korean speech recognition dataset results were 97%, 97%, and 90% for gender, age, and age-gender recognition, respectively. The prediction performance of our proposed model, which was obtained in the experiments, demonstrated the superiority and robustness of the tasks regarding age, gender, and age-gender recognition from speech signals.

show abstract

“…More recently, the Deep Learning (DL) paradigm has been applied to age estimation. For example, Deep Neural Networks (DNN) have been applied to predict both height and age of a speaker from short utterances [17]. In the case of age estimation, the Root Mean Squared Errors (RMSE) are 7.60 and 8.63 years for male and female respectively, when the mean duration of speech segments is around 2.5s.…”

Section: Introductionmentioning

confidence: 99%

Age group classification and gender recognition from speech with temporal convolutional neural networks

Sánchez-Hevia

Gil-Pita

Rosa-Zurera

2022

Multimed Tools Appl

View full text Add to dashboard Cite

This paper analyses the performance of different types of Deep Neural Networks to jointly estimate age and identify gender from speech, to be applied in Interactive Voice Response systems available in call centres. Deep Neural Networks are used, because they have recently demonstrated discriminative and representation capabilities in a wide range of applications, including speech processing problems based on feature extraction and selection. Networks with different sizes are analysed to obtain information on how performance depends on the network architecture and the number of free parameters. The speech corpus used for the experiments is Mozilla’s Common Voice dataset, an open and crowdsourced speech corpus. The results are really good for gender classification, independently of the type of neural network, but improve with the network size. Regarding the classification by age groups, the combination of convolutional neural networks and temporal neural networks seems to be the best option among the analysed, and again, the larger the size of the network, the better the results. The results are promising for use in IVR systems, with the best systems achieving a gender identification error of less than 2% and a classification error by age group of less than 20%.

show abstract

A Deep Neural Network Based End to End Model for Joint Height and Age Estimation from Short Duration Speech

Cited by 22 publications

References 14 publications

Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks

Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks

Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms

Age group classification and gender recognition from speech with temporal convolutional neural networks

Contact Info

Product

Resources

About