In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertaintyaware probabilistic speaker embedding training in the back-end. In the front-end stage, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In the backend stage, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score between the speech samples belonging to the same speaker, which provide not only speaker representations but also data uncertainty. Experimental results show that the proposed bootstrap equilibrium training strategy can effectively help learn the speaker representations and outperforms the conventional methods based on contrastive learning. Also, we demonstrate that the integrated two-stage framework further improves the speaker verification performance on the VoxCeleb1 test set in terms of EER and MinDCF.INDEX TERMS Speaker verification, self-supervised learning, bootstrap representation learning, probabilistic speaker embedding.
Purpose:The goal of this study was to build an accurate digital factory that evaluates the performance of a factory using computer simulation. To achieve this goal, we evaluated the effect of worker-related variables on production in a simulation model using comparative analysis of two cases. Methods: The overall work process and worker-related variables were determined and used to build a simulation model. Siemens PLM Software's Plant Simulation was used to build a simulation model. Also, two simulation models were built, where the only difference was the use of the worker-related variable, and the total daily production analyzed and compared in terms of the individual process. Additionally, worker efficiency was evaluated based on worker analysis. Results: When the daily production of the two models were compared, a 0.16% error rate was observed for the model where the worker-related variables were applied and error rate was approximately 5.35% for the model where the worker-related variables were not applied. In addition, the production in the individual processes showed lower error rate in the model that included the worker-related variables than the model where the worker-related variables were not used. Also, among the total of 22 workers, only three workers satisfied the IFRS (International Financial Reporting Standards) suggested worker capacity rate (90%). Conclusions: In the daily total production and individual process production, the model that included the worker-related variables produced results that were closer to the real production values. This result indicates the importance of worker elements as input variables, in regards to building accurate simulation models. Also, as suggested in this study, the model that included the worker-related variables can be utilized to analyze in more detail actual production. The results from this study are expected to be utilized to improve the work process and worker efficiency.
Domain mismatch problem caused by speakerunrelated feature has been a major topic in speaker recognition. In this paper, we propose an explicit disentanglement framework to unravel speaker-relevant features from speakerunrelated features via mutual information (MI) minimization. To achieve our goal of minimizing MI between speaker-related and speaker-unrelated features, we adopt a contrastive log-ratio upper bound (CLUB), which exploits the upper bound of MI. Our framework is constructed in a 3-stage structure. First, in the front-end encoder, input speech is encoded into shared initial embedding. Next, in the decoupling block, shared initial embedding is split into separate speaker-related and speakerunrelated embeddings. Finally, disentanglement is conducted by MI minimization in the last stage. Experiments on Far-Field Speaker Verification Challenge 2022 (FFSVC2022) demonstrate that our proposed framework is effective for disentanglement. Also, to utilize domain-unknown datasets containing numerous speakers, we pre-trained the front-end encoder with VoxCeleb datasets. We then fine-tuned the speaker embedding model in the disentanglement framework with FFSVC 2022 dataset. The experimental results show that fine-tuning with a disentanglement framework on a existing pre-trained model is valid and can further improve performance.
For training a few-shot keyword spotting (FS-KWS) model, a large labeled dataset containing massive target keywords has known to be essential to generalize to arbitrary target keywords with only a few enrollment samples. To alleviate the expensive data collection with labeling, in this paper, we propose a novel FS-KWS system trained only on synthetic data. The proposed system is based on metric learning enabling target keywords to be detected using distance metrics. Exploiting the speech synthesis model that generates speech with pseudo phonemes instead of texts, we easily obtain a large collection of multi-view samples with the same semantics. These samples are sufficient for training, considering metric learning does not intrinsically necessitate labeled data. All of the components in our framework do not require any supervision, making our method unsupervised. Experimental results on real datasets show our proposed method is competitive even without any labeled and real datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.