Supervised Contrastive Learning for Voice Activity Detection

Heo, Youngjun; Lee, Sunggu

doi:10.3390/electronics12030705

Cited by 2 publications

(2 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [6], a VAD learning strategy using Supervised Contrastive Learning (Supervised Contrastive Learning for Voice Activity Detection, SCLVAD) was proposed for the first time. The proposed method was used in combination with audio-specific data augmentation methods, which were trained using two common sets of English language speech data: the Google Speech Commands Dataset V2 and audio samples from the site freesound.org, and then evaluated using the third AVA-Speech English dataset.…”

Section: Literature Review and Problem Statementmentioning

confidence: 99%

The dependence of the effectiveness of neural networks for recognizing human voice on language

Nurlankyzy,

Akhmediyarova,

Zhetpisbayeva

et al. 2024

EEJET

View full text Add to dashboard Cite

This study examines the effectiveness of neural network architectures (multilayer perceptron MLP, convolutional neural network CNN, recurrent neural network RNN) for human voice recognition, with an emphasis on the Kazakh language. Problems related to language, the difference between speakers, and the influence of network architecture on recognition accuracy are considered. The methodology includes extensive training and testing, studying the accuracy of recognition in different languages, and different sets of data on speakers. Using a comparative analysis, this study evaluates the performance of three architectures trained exclusively in the Kazakh language. The testing included statements in Kazakhs and other languages, while the number of speakers varied to assess its impact on recognition accuracy. During the study, the results showed that CNN neural networks are more effective in recognizing human voice than RNN and MLP. Also, it was found that the CNN has a higher accuracy in recognizing the human voice in the Kazakh language, both for a small and for a large number of announcers. For example, for 20 speakers, the recognition error in Russian was 21.86 %, whereas in Kazakhs it was 10.6 %. A similar trend was observed for 80 speakers: 16.2 % Russians and 8.3 % Kazakhs. It can also be argued that learning one language does not guarantee high recognition accuracy in other languages. Therefore, the accuracy of human voice recognition by neural networks depends significantly on the language in which training is conducted. In addition, this study highlights the importance of different sets of speaker data to achieve optimal results. This knowledge is crucial for advancing the development of reliable human voice recognition systems that can accurately identify different human voices in different language contexts

show abstract

Section: Literature Review and Problem Statementmentioning

confidence: 99%

The dependence of the effectiveness of neural networks for recognizing human voice on language

Nurlankyzy,

Akhmediyarova,

Zhetpisbayeva

et al. 2024

EEJET

View full text Add to dashboard Cite

show abstract

“…These two advantages of SCL settings provide more efficient feature learning over the SSCL approach. Several studies in the audio domain have effectively applied SCL, for instance, in environmental sound classification [31], voice activity detection [32], accented speech recognition [33], and musical onset detection [34], exhibiting superior performance when compared to models trained using cross-entropy.…”

Section: Related Workmentioning

confidence: 99%

Identification of Non-Speaking and Minimal-Speaking Individuals Using Nonverbal Vocalizations

Tran,

Tsai

2024

IEEE Access

View full text Add to dashboard Cite

Speech remains a prevalent mode of communication powering various intelligent functions in human-computer interaction applications, notably in Speaker/Person Identification (PID) systems. However, there is a considerable population of Non-speaking and Minimal-speaking (NMS) individuals, who heavily rely on nonverbal vocalizations for communication, and the existing speech-based PID systems may not be suitable for users from this community. This study delves into the use of nonverbal vocalizations to identify NMS subjects, termed as NMS-PID, and explores the feasibility of developing an identification system, namely S-NMS-PID, that accommodates both speaking users (with speech input) and NMS users (with nonverbal-vocalization input). Leveraging the recently published ReCANVo dataset of NMS nonverbal vocalizations and our speech dataset, our experiments with multiple networks and acoustic features demonstrate promising results for NMS-PID and S-NMS-PID, evident in average accuracies ranging from 70% to 92%. The proposed convolutional recurrent neural network-based model, despite its smaller size, achieves results nearly on par with much deeper models such as VGG16 and ResNet50. Our findings also highlight the efficacy of Mel-frequency cepstral coefficients features compared to the spectrogram features. Furthermore, a two-step training strategy involving supervised contrastive learning for representation learning followed by fine-tuning with cross-entropy loss significantly enhances robustness and accuracy, particularly in classifying data from minority classes, enhancing overall performance. This study's outcomes hold potential for tailoring human-computer interaction applications specifically for NMS users. Implementing NMS-PID and S-NMS-PID in security and authentication processes ensures secure and reliable user identification across diverse platforms, transcending sole reliance on speech-based methods.

show abstract

Supervised Contrastive Learning for Voice Activity Detection

Cited by 2 publications

References 23 publications

The dependence of the effectiveness of neural networks for recognizing human voice on language

The dependence of the effectiveness of neural networks for recognizing human voice on language

Identification of Non-Speaking and Minimal-Speaking Individuals Using Nonverbal Vocalizations

Contact Info

Product

Resources

About