A Review on Voice-based Interface for Human-Robot Interaction

Badr, Ameer; Abdul-Hassan, Alia K.

doi:10.37917/ijeee.16.2.10

Cited by 19 publications

(10 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…TTS synthesis can be defined as one of the systems by which normal language text is converted into speech. There are many differences between machine speech production and human, however, the increase in the capability of machine learning paradigms for simulating human speech production mechanisms will result in a more natural and accurate TTS [13], [24]. In this study, the pyttsx3 library [25] has been used for TTS synthesis as a robot's speech response.…”

Section: Speech Response Based On Tts Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Gender detection in children’s speech utterances for human-robot interaction

Badr

Hassan²

2022

IJECE

View full text Add to dashboard Cite

The human voice speech essentially includes paralinguistic information used in many real-time applications. Detecting the children’s gender is considered a challenging task compared to the adult’s gender. In this study, a system for human-robot interaction (HRI) is proposed to detect the gender in children’s speech utterances without depending on the text. The robot's perception includes three phases: Feature’s extraction phase where four formants are measured at each glottal pulse and then a median is calculated across these measurements. After that, three types of features are measured which are formant average (AF), formant dispersion (DF), and formant position (PF). Feature’s standardization phase where the measured feature dimensions are standardized using the z-score method. The semantic understanding phase is where the children’s gender is detected accurately using the logistic regression classifier. At the same time, the action of the robot is specified via a speech response using the text to speech (TTS) technique. Experiments are conducted on the Carnegie Mellon University (CMU) Kids dataset to measure the suggested system’s performance. In the suggested system, the overall accuracy is 98%. The results show a relatively clear improvement in terms of accuracy of up to 13% compared to related works that utilized the CMU Kids dataset.

show abstract

Section: Speech Response Based On Tts Methodsmentioning

confidence: 99%

“…The formant frequencies' values decrease as the vocal tract length increases. Both male and female adults have higher formant frequencies compared to children [5], [13], [14]. Formants were only measured at the glottal pulse to make the measurement easier along with the whole utterance.…”

Section: Features Extraction Based On Formantsmentioning

confidence: 99%

Gender detection in children’s speech utterances for human-robot interaction

Badr

Hassan²

2022

IJECE

View full text Add to dashboard Cite

show abstract

“…Therefore, knowing the pros and cons of each classifier can help in selecting the suitable classifier precisely. Machine learning classification approaches such as SVM, human visual system (HVS), Naïve Bayes (NB), and K-NN represent the most discriminatory and appropriate classifiers' techniques [56]- [58]. Table 8 illustrated the pros and cons of machine learning classifiers that help in detecting drivers' drowsiness.…”

Section: Learning Processmentioning

confidence: 99%

Modern drowsiness detection techniques: a review

Jasim

Hassan

2022

IJECE

View full text Add to dashboard Cite

<span>According to recent statistics, drowsiness, rather than alcohol, is now responsible for one-quarter of all automobile accidents. As a result, many monitoring systems have been created to reduce and prevent such accidents. However, despite the huge amount of state-of-the-art drowsiness detection systems, it is not clear which one is the most appropriate. The following points will be discussed in this paper: Initial consideration should be given to the many sorts of existing supervised detecting techniques that are now in use and grouped into four types of categories (behavioral, physiological, automobile and hybrid), Second, the supervised machine learning classifiers that are used for drowsiness detection will be described, followed by a discussion of the advantages and disadvantages of each technique that has been evaluated, and lastly the recommendation of a new strategy for detecting drowsiness.</span>

show abstract

“…Between all types of speech-based feature extraction domains, Cepstral domain features are the most successful ones, where a cepstrum is obtained by taking the inverse Fourier transform of the signal spectrum. MFCC is the most important method to extract speech-based features in this domain [8]. MFCCs greatness stems from the ability to exemplify the spectrum of speech amplitude in a concise form.…”

Section: Mel-frequency Cepstral Coefficients (Mfccs)mentioning

confidence: 99%

“…These steps are shown in Figure 1. At the end of these steps, one energy and 12 cepstral features are obtained [8,10].…”

Section: Mel-frequency Cepstral Coefficients (Mfccs)mentioning

confidence: 99%

Age Estimation in Short Speech Utterances Based on Bidirectional Gated-Recurrent Neural Networks

Badr¹,

Abdul-Hassan²

2021

ETJ

View full text Add to dashboard Cite

Recently, age estimates from speech have received growing interest as they are important for ‎many applications like custom call routing, targeted marketing, or user-profiling. In this work, an ‎automatic system to estimate age in short speech utterances without ‎depending on the text is proposed. From each utterance frame, four ‎groups of features are extracted and then 10 statistical functionals are measured for each ‎extracted dimension of the features, to be followed by dimensionality reduction using Linear ‎Discriminant Analysis (LDA). Finally, bidirectional Gated-Recurrent Neural Networks (G-‎RNNs) are used to predict speaker age. Experiments are conducted on the VoxCeleb1 ‎dataset to show the performance of the proposed system, which is the first attempt to do so for ‎such a system. In gender-dependent system, the Mean Absolute Error (MAE) of the proposed system ‎is 9.25 years, and 10.33 ‎years, the Root Mean ‎Square Error (RMSE)‎ is 13.17 and 13.26, respectively, ‎for ‎female and male speakers. In gender_ independent system, the MAE of the proposed system is 10.96 years, and the RMSE is 15.47. The results show that the proposed system has a good performance on short-duration utterances, taking into consideration the high noise ratio in the VoxCeleb1 dataset. ‎

show abstract

A Review on Voice-based Interface for Human-Robot Interaction

Cited by 19 publications

References 0 publications

Gender detection in children’s speech utterances for human-robot interaction

Gender detection in children’s speech utterances for human-robot interaction

Modern drowsiness detection techniques: a review

Age Estimation in Short Speech Utterances Based on Bidirectional Gated-Recurrent Neural Networks

Contact Info

Product

Resources

About