Zexin Cai scite author profile

A novel learnable dictionary encoding layer is proposed in this paper for end-to-end language identification. It is inline with the conventional GMM i-vector approach both theoretically and practically. We imitate the mechanism of traditional GMM training and Supervector encoding procedure on the top of CNN. The proposed layer can accumulate high-order statistics from variable-length input sequence and generate an utterance level fixed-dimensional vector representation. Unlike the conventional methods, our new approach provides an end-to-end learning framework, where the inherent dictionary are learned directly from the loss function. The dictionaries and the encoding representation for the classifier are learned jointly. The representation is orderless and therefore appropriate for language identification. We conducted a preliminary experiment on NIST LRE07 closed-set task, and the results reveal that our proposed dictionary encoding layer achieves significant error reduction comparing with the simple average pooling.

show abstract

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Cai

Zhang²,

2020

View full text Add to dashboard Cite

High-fidelity speech can be synthesized by end-to-end text-tospeech models in recent years. However, accessing and controlling speech attributes such as speaker identity, prosody, and emotion in a text-to-speech system remains a challenge. This paper presents a system involving feedback constraints for multispeaker speech synthesis. We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network. The constraint is taken by an added loss related to the speaker identity, which is centralized to improve the speaker similarity between the synthesized speech and its natural reference audio. The model is trained and evaluated on publicly available datasets. Experimental results, including visualization on speaker embedding space, show significant improvement in terms of speaker identity cloning in the spectrogram level. In addition, synthesized samples are available online for listening. 1

show abstract

Insights in-to-End Learning Scheme for Language Identification

Cai

Liu

et al. 2018

View full text Add to dashboard Cite

A novel interpretable end-to-end learning scheme for language identification is proposed. It is in line with the classical GMM ivector methods both theoretically and practically. In the end-to-end pipeline, a general encoding layer is employed on top of the frontend CNN, so that it can encode the variable-length input sequence into an utterance level vector automatically. After comparing with the state-of-the-art GMM i-vector methods, we give insights into CNN, and reveal its role and effect in the whole pipeline. We further introduce a general encoding layer, illustrating the reason why they might be appropriate for language identification. We elaborate on several typical encoding layers, including a temporal average pooling layer, a recurrent encoding layer and a novel learnable dictionary encoding layer. We conducted experiment on NIST LRE07 closedset task, and the results show that our proposed end-to-end systems achieve state-of-the-art performance.

show abstract

Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-Level Embedding Features

Cai

Yang²,

Zhang³

et al. 2019

View full text Add to dashboard Cite

This paper describes a conditional neural network architecture for Mandarin Chinese polyphone disambiguation. The system is composed of a bidirectional recurrent neural network component acting as a sentence encoder to accumulate the context correlations, followed by a prediction network that maps the polyphonic character embeddings along with the conditions to corresponding pronunciations. We obtain the word-level condition from a pre-trained word-to-vector lookup table. One goal of polyphone disambiguation is to address the homograph problem existing in the front-end processing of Mandarin Chinese textto-speech system. Our system achieves an accuracy of 94.69% on a publicly available polyphonic character dataset. To further validate our choices on the conditional feature, we investigate polyphone disambiguation systems with multi-level conditions respectively. The experimental results show that both the sentence-level and the word-level conditional embedding features are able to attain good performance for Mandarin Chinese polyphone disambiguation.

show abstract

End-to-end Language Identification using NetFV and NetVLAD

Chen

Cai

et al. 2018

View full text Add to dashboard Cite

In this paper, we apply the NetFV and NetVLAD layers for the end-to-end language identification task. NetFV and NetVLAD layers are the differentiable implementations of the standard Fisher Vector and Vector of Locally Aggregated Descriptors (VLAD) methods, respectively. Both of them can encode a sequence of feature vectors into a fixed dimensional vector which is very important to process those variable-length utterances. We first present the relevances and differences between the classical i-vector and the aforementioned encoding schemes. Then, we construct a flexible end-to-end framework including a convolutional neural network (CNN) architecture and an encoding layer (NetFV or NetVLAD) for the language identification task. Experimental results on the NIST LRE 2007 close-set task show that the proposed system achieves significant EER reductions against the conventional i-vector baseline and the CNN temporal average pooling system, respectively.

show abstract

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Cai

Zhang²,

2020

Preprint

View full text Add to dashboard Cite

The Effects of ZnTe:Cu Back Contact on the Performance of CdTe Nanocrystal Solar Cells with Inverted Structure

et al. 2019

View full text Add to dashboard Cite

CdTe nanocrystal (NC) solar cells have received much attention in recent years due to their low cost and environmentally friendly fabrication process. Nowadays, the back contact is still the key issue for further improving device performance. It is well known that, in the case of CdTe thin-film solar cells prepared with the close-spaced sublimation (CSS) method, Cu-doped CdTe can drastically decrease the series resistance of CdTe solar cells and result in high device performance. However, there are still few reports on solution-processed CdTe NC solar cells with Cu-doped back contact. In this work, ZnTe:Cu or Cu:Au back contact layer (buffer layer) was deposited on the CdTe NC thin film by thermal evaporation and devices with inverted structure of ITO/ZnO/CdSe/CdTe/ZnTe:Cu (or Cu)/Au were fabricated and investigated. It was found that, comparing to an Au or Cu:Au device, the incorporation of ZnTe:Cu as a back contact layer can improve the open circuit voltage (Voc) and fill factor (FF) due to an optimized band alignment, which results in enhanced power conversion efficiency (PCE). By carefully optimizing the treatment of the ZnTe:Cu film (altering the film thickness and annealing temperature), an excellent PCE of 6.38% was obtained, which showed a 21.06% improvement compared with a device without ZnTe:Cu layer (with a device structure of ITO/ZnO/CdSe/CdTe/Au).

show abstract

Deep Speaker Embeddings with Convolutional Neural Network on Supervector for Text-Independent Speaker Recognition

Cai

2018

View full text Add to dashboard Cite

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Zexin Cai

A Novel Learnable Dictionary Encoding Layer for End-to-End Language Identification

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Insights in-to-End Learning Scheme for Language Identification

Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-Level Embedding Features

End-to-end Language Identification using NetFV and NetVLAD

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

The Effects of ZnTe:Cu Back Contact on the Performance of CdTe Nanocrystal Solar Cells with Inverted Structure

Deep Speaker Embeddings with Convolutional Neural Network on Supervector for Text-Independent Speaker Recognition

Contact Info

Product

Resources

About