Acoustic and Textual Data Augmentation for Improved ASR of Code-Switching Speech

Yılmaz, Emre; Heuvel, H. van den; Leeuwen, David A. van

doi:10.21437/interspeech.2018-52

Cited by 41 publications

(44 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CS ASR employs a bilingual acoustic model that captures the phonetic characteristics of both languages and a bilingual language model (LM) which can assign probabilities to code-mixed word sequences as well as monolingual word sequences from both languages. The current system uses data-augmented models described in [24]. The acoustic model is trained on automatically transcribed data from the same archive and a large amount of monolingual data from the high-resourced language (Dutch) together with the manually transcribed data form the FAME!…”

Section: Baseline Approach: Time Alignment Of Cs Asr Outputmentioning

confidence: 99%

“…training data is the only source of CS text and contains 140k words. The remaining CS text is automatically generated as described in [24].…”

Section: Speech and Text Datamentioning

confidence: 99%

“…The LM used for the baseline CS detection system is a standard bilingual 3-gram with interpolated Kneser-Ney smoothing. Further details are provided in [24]. We compute phone posteriors from the denominator graph (created using a phone LM estimated from the phone alignments of the training data) of the chain model and map them to phones using the existing implementation in Kaldi (nnet3-chain-compute-post).…”

Section: Implementation Detailsmentioning

confidence: 99%

See 2 more Smart Citations

Code-Switching Detection Using ASR-Generated Language Posteriors

Wang

Yılmaz

Derinel³

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

Code-switching (CS) detection refers to the automatic detection of language switches in code-mixed utterances. This task can be achieved by using a CS automatic speech recognition (ASR) system that can handle such language switches. In our previous work, we have investigated the code-switching detection performance of the Frisian-Dutch CS ASR system by using the time alignment of the most likely hypothesis and found that this technique suffers from over-switching due to numerous very short spurious language switches. In this paper, we propose a novel method for CS detection aiming to remedy this shortcoming by using the language posteriors which are the sum of the framelevel posteriors of phones belonging to the same language. The CS ASR-generated language posteriors contain more complete language-specific information on frame level compared to the time alignment of the ASR output. Hence, it is expected to yield more accurate and robust CS detection. The CS detection experiments demonstrate that the proposed language posterior-based approach provides higher detection accuracy than the baseline system in terms of equal error rate. Moreover, a detailed CS detection error analysis reveals that using language posteriors reduces the false alarms and results in more robust CS detection.

show abstract

Section: Baseline Approach: Time Alignment Of Cs Asr Outputmentioning

confidence: 99%

“…training data is the only source of CS text and contains 140k words. The remaining CS text is automatically generated as described in [24].…”

Section: Speech and Text Datamentioning

confidence: 99%

Section: Implementation Detailsmentioning

confidence: 99%

See 1 more Smart Citation

Code-Switching Detection Using ASR-Generated Language Posteriors

Wang

Yılmaz

Derinel³

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…Project, we have developed a spoken document retrieval system for the radio broadcast archives of Omrop Fryslân (Frisian Broadcast), the regional public broadcaster of the province Fryslân in the Netherlands. This system relies on automatically generated transcriptions hypothesized by a code-switching automatic speech recognition system [16] and speaker labels generated by a modern speaker recognition system developed using the resources [17] with the ultimate goal of making these archives searchable.…”

Section: Introductionmentioning

confidence: 99%

Large-Scale Speaker Diarization of Radio Broadcast Archives

Yılmaz¹,

Derinel²,

Zhou³

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

This paper describes our initial efforts to build a large-scale speaker diarization (SD) and identification system on a recently digitized radio broadcast archive from the Netherlands which has more than 6500 audio tapes with 3000 hours of Frisian-Dutch speech recorded between 1950-2016. The employed large-scale diarization scheme involves two stages: (1) tapelevel speaker diarization providing pseudo-speaker identities and (2) speaker linking to relate pseudo-speakers appearing in multiple tapes. Having access to the speaker models of several frequently appearing speakers from the previously collected FAME! speech corpus, we further perform speaker identification by linking these known speakers to the pseudo-speakers identified at the first stage. In this work, we present a recently created longitudinal and multilingual SD corpus designed for large-scale SD research and evaluate the performance of a new speaker linking system using x-vectors with PLDA to quantify cross-tape speaker similarity on this corpus. The performance of this speaker linking system is evaluated on a small subset of the archive which is manually annotated with speaker information. The speaker linking performance reported on this subset (53 hours) and the whole archive (3000 hours) is compared to quantify the impact of scaling up in the amount of speech data.

show abstract

“…The upper panel summarizes the number of words from each language subset. The middle panel provides the results of state-of-the-art ANN achitecturesYılmaz et al, 2018) for reference purposes and the lower panel presents the results achieved by the ANN and SNN models in this work (AM: acoustic model).…”

mentioning

confidence: 99%

Deep Spiking Neural Networks for Large Vocabulary Automatic Speech Recognition

Yılmaz

Zhang

et al. 2020

Front. Neurosci.

Self Cite

View full text Add to dashboard Cite

Artificial neural networks (ANN) have become the mainstream acoustic modeling technique for large vocabulary automatic speech recognition (ASR). A conventional ANN features a multi-layer architecture that requires massive amounts of computation. The brain-inspired spiking neural networks (SNN) closely mimic the biological neural networks and can operate on low-power neuromorphic hardware with spike-based computation. Motivated by their unprecedented energyefficiency and rapid information processing capability, we explore the use of SNNs for speech recognition. In this work, we use SNNs for acoustic modeling and evaluate their performance on several large vocabulary recognition scenarios. The experimental results demonstrate competitive ASR accuracies to their ANN counterparts, while require significantly reduced computational cost and inference time. Integrating the algorithmic power of deep SNNs with energy-efficient neuromorphic hardware, therefore, offer an attractive solution for ASR applications running locally on mobile and embedded devices.

show abstract

Acoustic and Textual Data Augmentation for Improved ASR of Code-Switching Speech

Cited by 41 publications

References 25 publications

Code-Switching Detection Using ASR-Generated Language Posteriors

Code-Switching Detection Using ASR-Generated Language Posteriors

Large-Scale Speaker Diarization of Radio Broadcast Archives

Deep Spiking Neural Networks for Large Vocabulary Automatic Speech Recognition

Contact Info

Product

Resources

About