Modeling and Training Strategies for Language Recognition Systems

Duroselle, Raphaël; Sahidullah, Md; Jouvet, Denis; Illina, Irina

doi:10.21437/interspeech.2021-277

Cited by 8 publications

(6 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1 in Table 4. We get the same conclusion as [16], that is, finetuning the LID task with the unfrozen encoder outperforms that with the frozen encoder. Note that although No.…”

Section: Experimental Results Under Different Training Strategiessupporting

confidence: 70%

“…It can be observed that these methods differ in whether the ASR encoder is frozen or not during the second stage of training. It is shown in [16] that the unfrozen encoder is superior in the recognition accuracy. In our preliminary experiments, we tried these two training strategies mentioned above and extracted fixed-length embeddings from some cross channel test data.…”

Section: A Trade-off Between the Recognition Accuracy And The General...mentioning

confidence: 99%

“…While the classical end-to-end LID methods using acoustic features can implicitly learn phonetic knowledge as well, phonetic features are more easily discovered by an ASR task with frame-level phonemic labels, considering utterance-level linguistic labels might be too coarse [16]. ASR tasks can efficiently extract phonetic features, and noise such as speaker information can be filtered.…”

Section: Introductionmentioning

confidence: 99%

“…Wang et al [26] analyze different conformer-based architectures and demonstrate the great improvement of a two-stage system, named transfer learning in their experiments. Duroselle et al [16,27] study different modeling and training strategies and show that bottleneck features can be greatly improved by using language identification loss during the training of the feature extractor. Alum et al [28] incorporate a large pre-trained multilingual XLSR-53 wav2vec2.0 model and reveal its excellent modeling abilities, that is, fine-tuning the model with just one utterance per target language already outperforms the baseline model that does not use pre-training but is trained with around 10,000 utterances per language.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Three-stage training and orthogonality regularization for spoken language recognition

Deng

et al. 2023

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Spoken language recognition has made significant progress in recent years, for which automatic speech recognition has been used as a parallel branch to extract phonetic features. However, there is still a lack of a better training strategy for such architectures of two individual branches. In this paper, we analyze the mostly used two-stage training strategies and reveal a trade-off between the recognition accuracy and the generalization ability. Based on the analysis, we propose a three-stage training strategy and an orthogonality regularization method. The former adds a multi-task learning stage to the traditional two-stage training strategy to extract hybrid-level and noiseless features, which can improve the recognition accuracy on the basis of maintaining the generalization ability, while the latter constrains the orthogonality of base vectors and introduces prior knowledge to improve the recognition accuracy. Experiments on the Oriental Language Recognition (OLR) dataset indicate that these two proposed methods can improve both the language recognition accuracy and the generalization ability, especially in complex challenge tasks, such as cross-channel or noisy conditions. Also, our model, which combines these two proposed methods, performs better than the top three teams in the OLR20 challenge.

show abstract

“…1 in Table 4. We get the same conclusion as [16], that is, finetuning the LID task with the unfrozen encoder outperforms that with the frozen encoder. Note that although No.…”

Section: Experimental Results Under Different Training Strategiessupporting

confidence: 70%

Section: A Trade-off Between the Recognition Accuracy And The General...mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Three-stage training and orthogonality regularization for spoken language recognition

Deng

et al. 2023

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…Bottleneck features. An acoustic model trained for ASR can also be used for other tasks which rely on the phonetic content but do not require a word-level transcription, such as language identification [63] or keyword spotting [77]. In such cases, instead of using the acoustic model output (triphone posterior probabilities), a sequence of phonetic features called bottleneck (BN) features is extracted from an intermediate layer of the acoustic model [81] and used, possibly in combination with other features, as input to these tasks.…”

Section: Speech Processingmentioning

confidence: 99%

Differentially Private Speaker Anonymization

Shamsabadi

Srivastava

Bellet

et al. 2023

PoPETs

View full text Add to dashboard Cite

Sharing real-world speech utterances is key to the training and deployment of voice-based services. However, it also raises privacy risks as speech contains a wealth of personal data. Speaker anonymization aims to remove speaker information from a speech utterance while leaving its linguistic and prosodic attributes intact. State-of-the-art techniques operate by disentangling the speaker information (represented via a speaker embedding) from these attributes and re-synthesizing speech based on the speaker embedding of another speaker. Prior research in the privacy community has shown that anonymization often provides brittle privacy protection, even less so any provable guarantee. In this work, we show that disentanglement is indeed not perfect: linguistic and prosodic attributes still contain speaker information. We remove speaker information from these attributes by introducing differentially private feature extractors based on an autoencoder and an automatic speech recognizer, respectively, trained using noise layers. We plug these extractors in the state-of-the-art anonymization pipeline and generate, for the first time, private speech utterances with a provable upper bound on the speaker information they contain. We evaluate empirically the privacy and utility resulting from our differentially private speaker anonymization approach on the LibriSpeech data set. Experimental results show that the generated utterances retain very high utility for automatic speech recognition training and inference, while being much better protected against strong adversaries who leverage the full knowledge of the anonymization process to try to infer the speaker identity.

show abstract

The Effect of Synthetic Voice Data Augmentation on Spoken Language Identification on Indian Languages

Ambili,

Roy

2023

IEEE Access

View full text Add to dashboard Cite

Multilingual based voice activated human computer interaction systems are currently in high demand. The Spoken Language Identification Unit (SPLID) is an inevitable front end unit of such a multilingual system. These systems will be a great boon to a country like India where around 24 official languages are spoken. Deep learning architectures for spoken language identification have progressed to the point that they can now perform well, even in the presence of various background noises. However, a strong phonetic relationship across various Indian languages leads to increased confusion in the SPLID unit. Therefore, the goal of this study is to propose a synthetic voice data augmentation method based on speech synthesis to improve the spoken Indian language identification system. Here the research attempts to determine how well pre-trained computer vision models recognize spoken languages in synthetic and classical audio augmentation environments. The accuracy of the models was compared using bottleneck features extracted from three different pre-trained models VGG16, RESNET50, and Inception-v3 while using an Artificial Neural Network (ANN), Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), Naive Bayes (NB), Decision Tree (DT) and KNN (K-Nearest Neighbors) as classifiers.The proposed system was tested on three Indian language datasets -two comprising seven Indian languages (Hindi, Malayalam, Tamil, Telugu, Marathi, Kannada and Bengali), one containing five Indian languages (Tamil, Hindi, Malayalam, Oria and Assamese), and on a foreign language dataset. It was found that the addition of synthetic audio samples improved the accuracy by 17%. Among the pre-trained models, VGG16 and Inception-v3 combined with PCA and ANN were found to have the maximum accuracy of 97% .

show abstract

Modeling and Training Strategies for Language Recognition Systems

Cited by 8 publications

References 32 publications

Three-stage training and orthogonality regularization for spoken language recognition

Three-stage training and orthogonality regularization for spoken language recognition

Differentially Private Speaker Anonymization

The Effect of Synthetic Voice Data Augmentation on Spoken Language Identification on Indian Languages

Contact Info

Product

Resources

About