In this study, we propose a speaker-dependent WaveNet vocoder, a method of synthesizing speech waveforms with WaveNet, by utilizing acoustic features from existing vocoder as auxiliary features of WaveNet. It is expected that WaveNet can learn a sample-by-sample correspondence between speech waveform and acoustic features. The advantage of the proposed method is that it does not require (1) explicit modeling of excitation signals and (2) various assumptions, which are based on prior knowledge specific to speech. We conducted both subjective and objective evaluation experiments on CMU-ARCTIC database. From the results of the objective evaluation, it was demonstrated that the proposed method could generate high-quality speech with phase information recovered, which was lost by a mel-cepstrum vocoder. From the results of the subjective evaluation, it was demonstrated that the sound quality of the proposed method was significantly improved from mel-cepstrum vocoder, and the proposed method could capture source excitation information more accurately.
This paper presents a statistical voice conversion (VC) technique with the WaveNet-based waveform generation. VC based on a Gaussian mixture model (GMM) makes it possible to convert the speaker identity of a source speaker into that of a target speaker. However, in the conventional vocoding process, various factors such as F0 extraction errors, parameterization errors and over-smoothing effects of converted feature trajectory cause the modeling errors of the speech waveform, which usually bring about sound quality degradation of the converted voice. To address this issue, we apply a direct waveform generation technique based on a WaveNet vocoder to VC. In the proposed method, first, the acoustic features of the source speaker are converted into those of the target speaker based on the GMM. Then, the waveform samples of the converted voice are generated based on the WaveNet vocoder conditioned on the converted acoustic features. In this paper, to investigate the modeling accuracies of the converted speech waveform, we compare several types of the acoustic features for training and synthesizing based on the WaveNet vocoder. The experimental results confirmed that the proposed VC technique achieves higher conversion accuracy on speaker individuality with comparable sound quality compared to the conventional VC technique.
This paper describes an extension of separable lattice 2-D HMMs (SL-HMMs) using state duration models for image recognition. SLHMMs are generative models which have size and location invariances based on state transition of HMMs. However, the state duration probability of HMMs exponentially decreases with increasing duration, therefore it may not be appropriate for modeling image variations accuratelty. To overcome this problem, we employ the structure of hidden semi Markov models (HSMMs) in which the state duration probability is explicitly modeled by parametric distributions. Face recognition experiments show that the proposed model improved the performance for images with size and location variations.
SUMMARYThis paper proposes a Bayesian approach to image recognition based on separable lattice hidden Markov models (SL-HMMs). The geometric variations of the object to be recognized, e.g., size, location, and rotation, are an essential problem in image recognition. SL-HMMs, which have been proposed to reduce the effect of geometric variations, can perform elastic matching both horizontally and vertically. This makes it possible to model not only invariances to the size and location of the object but also nonlinear warping in both dimensions. The maximum likelihood (ML) method has been used in training SL-HMMs. However, in some image recognition tasks, it is difficult to acquire sufficient training data, and the ML method suffers from the over-fitting problem when there is insufficient training data. This study aims to accurately estimate SL-HMMs using the maximum a posteriori (MAP) and variational Bayesian (VB) methods. The MAP and VB methods can utilize prior distributions representing useful prior information, and the VB method is expected to obtain high generalization ability by marginalization of model parameters. Furthermore, to overcome the local maximum problem in the MAP and VB methods, the deterministic annealing expectation maximization algorithm is applied for training SL-HMMs. Face recognition experiments performed on the XM2VTS database indicated that the proposed method offers significantly improved image recognition performance. Additionally, comparative experiment results showed that the proposed method was more robust to geometric variations than convolutional neural networks. key words: image recognition, hidden Markov models, separable lattice hidden Markov models, Bayesian approach, deterministic annealing
Our aim is to develop a smartphone-based life-logging system. Human activity recognition (HAR) is one of the core techniques to realize it. Recent studies reported the effectiveness of feed-forward neural network (FF-NN) and recurrent neural network (RNN) as a classifier for HAR task. However, there are still unresolved problems in those studies: (1) a life-logging system using only a smartphone for recording device has not been developed, (2) only indoor activities have been utilized for evaluation, (3) insufficient investigations/evaluations of RNN. In this study, we address these unresolved problems as follows: (1) we build a prototype system for life-logging and conduct data recording experiment on this system to include both indoor and outdoor activities. The experimental results of HAR on this new dataset showed that RNN-based classifier was still effective. (2) From the results of a HAR experiment, it was demonstrated that a multi-layered Simple Recurrent Unit with a non-linear transform at the bottom layer and a highway-connection was the most effective. (3) We could grasp the reason for the improvement of RNN from FF-NN by observing the posterior probabilities over test data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.