In recent years, neural vocoders have surpassed classical speech generation approaches in naturalness and perceptual quality of the synthesized speech. Computationally heavy models like WaveNet and WaveGlow achieve best results, while lightweight GAN models, e.g. MelGAN and Parallel WaveGAN, remain inferior in terms of perceptual quality. We therefore propose StyleMelGAN, a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity. StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech. For efficient training, multiple random-window discriminators adversarially evaluate the speech signal analyzed by a filter bank, with regularization provided by a multi-scale spectral reconstruction loss. The highly parallelizable speech generation is several times faster than real-time on CPUs and GPUs. MUSHRA and P.800 listening tests show that StyleMelGAN outperforms prior neural vocoders in copy-synthesis and Text-to-Speech scenarios.
A new system for finger-vein recognition is proposed based on the Complete Local Binary pattern (CLBP) as a feature extractor and the Phase Only Correlation (POC) for post-processing alignment and for speeding up the system. The CLBP produces three components of image descriptors and thus holds more details compared to the previous methods such as the Local Binary Pattern (LBP), the Local Directional Pattern (LDP), the Local Line Binary Pattern (LLBP), the Repeated Line Tracking (RLT), the Maximum Curvature (MC) and the Wide Line Detector (WLD). In the proposed system, POC is used for two purposes. First, to increase the performance of the system the alignment between the CLBP components of the test image and the enrolled CLBP components are performed. Second, to speed up the matching stage, a portion of the enrolled images is excluded that are highly misaligned with the test image from the Hamming Distance (HD) measure competition in the matching stage. To make the system more secure against attacks targeting personal information, only CLBP components are enrolled in the system and the alignment process POC is implemented on these components without the need to original images. For image pre-processing a novel scheme of preprocessing methods is adopted including finger-vein localization, alignment, and the Region-Of-Interest (ROI) extraction and enhancement. Two databases, UTFVP and SDUMLA-HMT, are used to evaluate the performance of the system. The results have shown that the values for the Identification Recognition Rate (IRR) and the Equal Error Rate (EER) are respectively (99.66%) and (0.139) for the UTFVP database and (98.95%, and 0.53%) for SDUMLA-HMT database. These results are competitive compared to those achieved by the state-of-art systems.
In this paper, a system based on image descriptor and Local Histogram Concatenation (LHC) for finger vein recognition is introduced. The LHC of image descriptors such as LBP, LDP CLBP cannot be inverted back to the original images, therefore they can provide good security if stored as enrolled data. On the other hand, the technique of LHC does not depict spatial information, therefore it is expected to be less sensitive to image misalignment if a measure such as the histogram difference [Formula: see text] is used for recognition. The use of histogram difference makes the system more robust to misalignment compared to the pixel-by-pixel-based measures such as the Hamming Distance (HD). The approach of LHC is implemented by dividing the image descriptor into non-overlapped grids, then the histogram within each grid is calculated and concatenated with the histograms of the preceding grids and finally, the concatenated histograms of each two images are compared using [Formula: see text] measure. Two datasets, UTFVP and SDUMLA-HMT, are used for testing the performance of the system. The results have shown that the Identification Recognition Rate (IRR) is improved when LHCs of the image descriptors with [Formula: see text] measure are used compared to the use of only the image descriptors with HD measure. For UTFVP dataset, the IRR values were 97.44%, 95% and 98.37% when LHC and [Formula: see text] were used with LBP, LDP and CLBP, respectively, while these values were 89.44%, 92.63% and 92.92% when only LBP, LDP and CLBP with HD were used. For SDUMLA-HMT dataset, the IRR values of the system were 98.43%, 98.69% and 98.85% when LHC and [Formula: see text] were used with LBP, LDP and CLBP, respectively, while these values were 97.6%, 98.24% and 97.27% when only the image descriptors LBP, LDP and CLBP with HD were used.
This paper aims at improving the performance of finger-vein recognition system using a new scheme of image preprocessing. The new scheme includes three major steps, RGB to Gray conversion, ROI extraction and alignment and ROI enhancement. ROI extraction and alignment includes four major steps. First, finger-vein boundaries are detected using two edge detection masks each of size (4 x 6). Second, the correction for finger rotation is done by calculating the finger base line from the midpoints between the upper and lower boundaries using least square method. Third, ROI is extracted by cropping a rectangle around the center of the finger-vein which is determined using the first and second invariant moments. Forth, the extracted ROI is normalized to a unified size of 192 x 64 in order to compensate for scale changes. ROI enhancement is done by applying the technique of Contrast-Limited Adaptive Histogram Equalization (CLAHE), followed by median and modified Gaussian high pass filters. The application of the given preprocessing scheme to a finger-vein recognition system revealed its efficiency when used with different methods of feature extractors and with different types of finger-vein database. For the University of Twente Finger Vascular Pattern (UTFVP) database, the achieved Identification Recognition Rates (IRR) for identification mode using three feature extraction methods Local Binary Pattern (LBP), Local Directional Pattern (LDP) and Local Line Binary Pattern (LLBP) are (99.79, 99.86 and 99.86) respectively, while the achieved Equal Error Rates (EER) for verification mode for the same feature extraction methods are (0.139, 0.069 and 0.035). For the Shandong University Machine Learning and Applications - Homologous Multi-modal Traits (SDUMLA-HMT) database, the achieved Identification Recognition Rates (IRR) for identification mode using three feature extraction methods LBP, LDP and LLBP are (99.57, 99.73 and 99.65) respectively, while the achieved Equal Error Rates (EER) for verification mode for the same feature extraction methods are (0.419, 0.262 and 0.341). These results outrage those of the previous state-of-art methods.
Classical parametric speech coding techniques provide a compact representation for speech signals. This affords a very low transmission rate but with a reduced perceptual quality of the reconstructed signals. Recently, autoregressive deep generative models such as WaveNet and SampleRNN have been used as speech vocoders to scale up the perceptual quality of the reconstructed signals without increasing the coding rate. However, such models suffer from a very slow signal generation mechanism due to their sample-by-sample modelling approach. In this work, we introduce a new methodology for neural speech vocoding based on generative adversarial networks (GANs). A fake speech signal is generated from a very compressed representation of the glottal excitation using conditional GANs as a deep generative model. This fake speech is then refined using the LPC parameters of the original speech signal to obtain a natural reconstruction. The reconstructed speech waveforms based on this approach show a higher perceptual quality than the classical vocoder counterparts according to subjective and objective evaluation scores for a dataset of 30 male and female speakers. Moreover, the usage of GANs enables to generate signals in one-shot compared to autoregressive generative models. This makes GANs promising for exploration to implement high-quality neural vocoders.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.