We can recognize a person by his voice alone. In principle, the sound has a tone (pitch) that is different for each person. This study aims to measure a Deep Neural Network (DNN) performance with static and dynamic prosodic features. Prosodic is information about sound related to tone, intonation, pressure, duration, and rhythm of a person's pronunciation. The data used is dictated and spontaneous voice data that taken from YouTube. The data used consists of three male voices and one female voice. The data is segmented into various duration, 3 seconds, 5 seconds, and 10 seconds. After the data has been segmented, the static prosodic features with 103 dimensions will be extracted and the dynamic prosodic features with 13 dimensions will be extracted too. Each feature and feature combination will be trained and tested using DNN with a ratio of 90:10. The result shows that the 10 seconds segmented data has higher accuracy than the others. Accuracy of static prosodic features is better than dynamic prosodic features. The average accuracy of DNN for static prosodic features is 87.02%. The average accuracy of DNN for dynamic prosodic features is 72.97%. The average accuracy of DNN for combined static and dynamic prosodic features is 87.72%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.