Voice gender recognition under unconstrained environments using self-attention

Nasef, Mohammed M.; Sauber, Amr M.; Nabil, Mohammed M.

doi:10.1016/j.apacoust.2020.107823

Cited by 20 publications

(11 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MFCC is the coefficient of the short-time windowed signal obtained by fast Fourier transformation (FFT), which has better results than the time domain operation. MFCC feature extraction mainly includes six steps 18 : pre-weighting, framing, windowing, FFT, Meyer filter bank and discrete cosine transform (DCT), as shown in Fig. 1 .…”

Section: Proposed Modelsmentioning

confidence: 99%

“…Jung et al used short-time Fourier transform and MFCC to extract the features of lung sounds, revealed the relationship between lung sounds and pulmonary mechanism, and employed the depth separable CNN to effectively classify four types of lung sounds 17 . Nasef et al reported a recognition technique to distinguish gender using MFCC features and Logistic Regression (LG) classifier, which can be carried out in the presence of background noise and different language, accent, age and emotional states 18 . It can be seen that MFCC is a powerful method to represent intrinsic characteristics of the sound signals.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Coal-gangue recognition via multi-branch convolutional neural network based on MFCC in noisy environment

Jiang

Zong

Song

et al. 2023

Sci Rep

View full text Add to dashboard Cite

Traditional coal-gangue recognition methods usually do not consider the impact of equipment noise, which severely limits its adaptability and recognition accuracy. This paper mainly studies the more accurate recognition of coal-gangue in the noise site environment with the operation of shearer, conveyor, transfer machine and other device in the process of top coal caving. Mel Frequency Cepstrum Coefficients (MFCC) smoothing method was introduced to express the intrinsic feature of sound pressure more clearly in the coal-gangue recognition site. Then, a multi-branch convolution neural network (MBCNN) model with three branches was developed, and the smoothed MFCC feature was incorporated into this model to realize the recognition of falling coal and gangue in noisy environment. The sound pressure signal datasets under the operation of different device were constructed through a great deal of laboratory and site data acquisition. Comparative experiments were carried out on noiseless dataset, single noise dataset and simulated site dataset, and the results show that our method can provide higher correct recognition accuracy and better robustness. The proposed coal-gangue recognition approach based on MBCNN and MFCC smoothing can not only recognize the state of falling coal or gangue, but also recognize the operational state of site device.

show abstract

Section: Proposed Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Coal-gangue recognition via multi-branch convolutional neural network based on MFCC in noisy environment

Jiang

Zong

Song

et al. 2023

Sci Rep

View full text Add to dashboard Cite

show abstract

“…[35]), voice processing (e.g. [36], [37]) and hash-based cross-modal retrieval of images and texts (e.g. [38]).…”

Section: Related Workmentioning

confidence: 99%

Hybrid DAER Based Cross-modal Retrieval Exploiting Deep Representation Learning

Huang

2023

Preprint

View full text Add to dashboard Cite

Information retrieval across multi-modal has attracted much attention from academics and practitioners. One key challenge of cross-modal retrieval is to eliminate the heterogeneous gap between different patterns. Most of the existing methods tend to jointly construct a common subspace. However, very little attention has been given to the study of the importance of different fine-grained regions of various modalities. This lack of considerations significantly influences the utilization of the extracted information of multiple modalities. Therefore, this study proposes a novel text-image cross-modal retrieval approach that constructs the dual attention network and the enhanced relation network (DAER). More specifically, the dual attention network tends to precisely extract fine-grained weight information from text and images, while the enhanced relation network is used to expand the differences between different categories of data in order to improve the computational accuracy of similarity. The comprehensive experimental results on three widely-used major datasets (i.e. Wikipedia, Pascal Sentence, and XMediaNet) show that our proposed approach is effective and superior to existing cross-modal retrieval methods.

show abstract

“…For example, average frequency, mode and standard deviation. The second approach is to use the spectral properties of the sound, like MFCC's and Log-Mel features [18].…”

Section: Related Workmentioning

confidence: 99%

“…One of the main uses of spectrograms is sound analysis. The [18] display of signals in the time-frequency field provides many benefits in terms of sound classification. First, the time-frequency conversion is reversible.…”

Section: Figure 2 Mfcc Feature Inference Stepsmentioning

confidence: 99%

Speech-to-Gender Recognition Based on Machine Learning Algorithms

Hizlisoy

ÇOLAKOĞLU>

Arslan

2022

International Journal of Applied Mathematics Electronics and Computers

View full text Add to dashboard Cite

Speech recognition has several application areas such as human machine interaction, classification of phone calls by gender, voice tagging, STT, etc. Predicting gender from audio signals is a problem that is easy for humans to solve, difficult to solve by a computer. In this study, a model based on MFCC and classification with machine learning is proposed for gender estimation from Turkish voice signals. Within the scope of the study, 58 different series and films were examined and a new original dataset was created with 894 audio recordings consisting of 5 sec sections taken from them. Mel-frequency cepstral coefficients (MFCC) and spectrogram, which are frequently used in the literature, were used for feature extraction from audio data. The results were first evaluated separately using two features in one way. A hybrid feature vector was then created using two feature vectors. Different machine learning algorithms (LR, DT, RF, XGB etc.) were tested in the classification process and it was seen that the best accuracy was achieved in the hybrid model and logistic regression with 89%. Recall, precision and f-score values were obtained as 86.8%, 92% and 89.3%, respectively. The obtained test results revealed that the proposed model, together with the hybrid feature vector used, the original dataset and the classifier based on machine learning, showed classification success in terms of accuracy and was a stable and robust model.

show abstract

Voice gender recognition under unconstrained environments using self-attention

Cited by 20 publications

References 14 publications

Coal-gangue recognition via multi-branch convolutional neural network based on MFCC in noisy environment

Coal-gangue recognition via multi-branch convolutional neural network based on MFCC in noisy environment

Hybrid DAER Based Cross-modal Retrieval Exploiting Deep Representation Learning

Speech-to-Gender Recognition Based on Machine Learning Algorithms

Contact Info

Product

Resources

About