Lattice-Free Mmi Adaptation of Self-Supervised Pretrained Acoustic Models

Vyas, Apoorv; Madikeri, Srikanth; Bourlard, Hervé

doi:10.1109/icassp39728.2021.9414741

Cited by 8 publications

(5 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Second, E2E systems model AM and LM jointly, and they are mostly trained with connectionist temporal classification (CTC) loss [34] (enabling alignment-free training). In [35], it is compared CTC and LF-MMI adaptation of pre-trained models. Recently, attention-based (e.g., Transformers) have become the de facto choice for AM [4,10,36].…”

Section: Related Workmentioning

confidence: 99%

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted Asr? an Extensive Benchmark on Air Traffic Control Communications

Zuluaga-Gómez

Prasad

Nigmatulina

et al. 2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to build robust end-to-end (E2E) acoustic models (AM) that can be later fine-tuned on downstream tasks e.g., automatic speech recognition (ASR). Yet, few works investigated the impact on performance when the data properties substantially differ between the pre-training and fine-tuning phases, termed domain shift. We target this scenario by analyzing the robustness of Wav2Vec 2.0 and XLS-R models on downstream ASR for a completely unseen domain, air traffic control (ATC) communications. We benchmark these two models on several open-source and challenging ATC databases with signal-to-noise ratio between 5 to 20 dB. Relative word error rate (WER) reductions between 20% to 40% are obtained in comparison to hybrid-based ASR baselines by only fine-tuning E2E acoustic models with a smaller fraction of labeled data. We analyze WERs on the low-resource scenario and gender bias carried by one ATC dataset.

show abstract

Section: Related Workmentioning

confidence: 99%

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted Asr? an Extensive Benchmark on Air Traffic Control Communications

Zuluaga-Gómez

Prasad

Nigmatulina

et al. 2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

show abstract

“…One such model is the XLSR [41], which can then be fine-tuned to ATC data. The authors of [42] proposed to use the LF-MMI criterion (similar to hybrid-based ASR) for the supervised adaptation of the self-supervised pretrained XLSR model [41]. We employed this technique to fine-tune the pre-trained model on our annotated ATC data.…”

Section: Automatic Speech Recognitionmentioning

confidence: 99%

An Automatic Speaker Clustering Pipeline for the Air Traffic Communication Domain

Khalil,

Prasad,

Motlicek

et al. 2023

Aerospace

Self Cite

View full text Add to dashboard Cite

In air traffic management (ATM), voice communications are critical for ensuring the safe and efficient operation of aircraft. The pertinent voice communications—air traffic controller (ATCo) and pilot—are usually transmitted in a single channel, which poses a challenge when developing automatic systems for air traffic management. Speaker clustering is one of the challenges when applying speech processing algorithms to identify and group the same speaker among different speakers. We propose a pipeline that deploys (i) speech activity detection (SAD) to identify speech segments, (ii) an automatic speech recognition system to generate the text for audio segments, (iii) text-based speaker role classification to detect the role of the speaker—ATCo or pilot in our case—and (iv) unsupervised speaker clustering to create a cluster of each individual pilot speaker from the obtained speech utterances. The speech segments obtained by SAD are input into an automatic speech recognition (ASR) engine to generate the automatic English transcripts. The speaker role classification system takes the transcript as input and uses it to determine whether the speech was from the ATCo or the pilot. As the main goal of this project is to group the speakers in pilot communication, only pilot data acquired from the classification system is employed. We present a method for separating the speech parts of pilots into different clusters based on the speaker’s voice using agglomerative hierarchical clustering (AHC). The performance of the speaker role classification and speaker clustering is evaluated on two publicly available datasets: the ATCO2 corpus and the Linguistic Data Consortium Air Traffic Control Corpus (LDC-ATCC). Since the pilots’ real identities are unknown, the ground truth is generated based on logical hypotheses regarding the creation of each dataset, timing information, and the information extracted from associated callsigns. In the case of speaker clustering, the proposed algorithm achieves an accuracy of 70% on the LDC-ATCC dataset and 50% on the more noisy ATCO2 dataset.

show abstract

“…Second, E2E systems model AM and LM jointly, and they are mostly trained with connectionist temporal classification (CTC) loss [25] (enabling alignment-free training). [26] compares CTC and LF-MMI adaption of pre-trained models. Recently, attention-based (e.g., Transformers) have become de facto choice for AM [4,27,9].…”

Section: Related Workmentioning

confidence: 99%

How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

Zuluaga-Gómez¹,

Prasad²,

Nigmatulina³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to build robust endto-end (E2E) acoustic models (AM) that can be later finetuned on downstream tasks e.g., automatic speech recognition (ASR). Yet, few works investigated the impact on performance when the data substantially differs between the pre-training and downstream fine-tuning phases (i.e., domain shift). We target this scenario by analyzing the robustness of Wav2Vec2.0 and XLS-R models on downstream ASR for a completely unseen domain, i.e., air traffic control (ATC) communications. We benchmark the proposed models on four challenging ATC test sets (signal-to-noise ratio varies between 5 to 20 dB). Relative word error rate (WER) reduction between 20% to 40% are obtained in comparison to hybrid-based state-of-the-art ASR baselines by fine-tuning E2E acoustic models with a small fraction of labeled data. We also study the impact of fine-tuning data size on WERs, going from 5 minutes (few-shot) to 15 hours.

show abstract

Lattice-Free Mmi Adaptation of Self-Supervised Pretrained Acoustic Models

Cited by 8 publications

References 16 publications

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted Asr? an Extensive Benchmark on Air Traffic Control Communications

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted Asr? an Extensive Benchmark on Air Traffic Control Communications

An Automatic Speaker Clustering Pipeline for the Air Traffic Communication Domain

How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

Contact Info

Product

Resources

About