Boosting of Contextual Information in ASR for Air-Traffic Call-Sign Recognition

Kocour, Martin; Veselý, Karel; Blatt, Alexander; Gomez, Juan Zuluaga; Szöke, Igor; Černocký, Jaň; Klakow, Dietrich; Motlíček, Petr

doi:10.21437/interspeech.2021-1619

Cited by 11 publications

(12 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [11] unlabeled ATC speech is employed in semi-supervised learning to decrease word error rates. Boosting of contextual knowledge during and after decoding has also been explored in [21,22,23].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

Zuluaga-Gómez¹,

Prasad²,

Nigmatulina³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to build robust endto-end (E2E) acoustic models (AM) that can be later finetuned on downstream tasks e.g., automatic speech recognition (ASR). Yet, few works investigated the impact on performance when the data substantially differs between the pre-training and downstream fine-tuning phases (i.e., domain shift). We target this scenario by analyzing the robustness of Wav2Vec2.0 and XLS-R models on downstream ASR for a completely unseen domain, i.e., air traffic control (ATC) communications. We benchmark the proposed models on four challenging ATC test sets (signal-to-noise ratio varies between 5 to 20 dB). Relative word error rate (WER) reduction between 20% to 40% are obtained in comparison to hybrid-based state-of-the-art ASR baselines by fine-tuning E2E acoustic models with a small fraction of labeled data. We also study the impact of fine-tuning data size on WERs, going from 5 minutes (few-shot) to 15 hours.

show abstract

Section: Related Workmentioning

confidence: 99%

“…ATCO2-Test: development and evaluation set available as open-source and presented at Interspeech 2021 [11,21]. The data consists of ATC communications from different airports located in Australia, Czech Republic, Slovakia and, Switzerland (see ATCO2 website 3 ).…”

Section: Datasets and Experimental Setupmentioning

confidence: 99%

How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

Zuluaga-Gómez¹,

Prasad²,

Nigmatulina³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Various works have already investigated context incorporation in the ASR [5,6,7], which marks the prior step in the ATC speech processing pipeline. Two other works of the ATCO2 project [8,9] show that the combination of HCLG and lattice boosting using Kaldi [10], reduces the ATC-ASR errors, especially for the call-signs. We build on top of these works by extracting the (erroneous) call-signs from the ASR transcripts and map them to the standardized ICAO format.…”

Section: Related Workmentioning

confidence: 99%

Call-Sign Recognition and Understanding for Noisy Air-Traffic Transcripts Using Surveillance Information

Blatt

Kocour

Veselý

et al. 2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Air traffic control (ATC) relies on communication via speech between pilot and air-traffic controller (ATCO). The call-sign, as unique identifier for each flight, is used to address a specific pilot by the ATCO. Extracting the call-sign from the communication is a challenge because of the noisy ATC voice channel and the additional noise introduced by the receiver. A low signal-to-noise ratio (SNR) in the speech leads to high word error rate (WER) transcripts. We propose a new callsign recognition and understanding (CRU) system that addresses this issue. The recognizer is trained to identify callsigns in noisy ATC transcripts and convert them into the standard International Civil Aviation Organization (ICAO) format. By incorporating surveillance information, we can multiply the call-sign accuracy (CSA) up to a factor of four. The introduced data augmentation adds additional performance on high WER transcripts and allows the adaptation of the model to unseen airspaces.

show abstract

“…The description of acoustic ELD based on stateof-the-art x-vectors is given in Section 3. As the speech-to-text technology is one of our building blocks, we briefly discuss it in Section 4, we kindly ask the reader to follow [11] for more information. Description of various language detection systems based on the ASR output is presented in Section 5.…”

Section: Motivationmentioning

confidence: 99%

“…A more detailed description of the ASR systems is out of the scope of this paper. The reader is kindly asked to find the details in [11].…”

Section: Speech-to-textmentioning

confidence: 99%

Detecting English Speech in the Air Traffic Control Voice Communication

Szöke¹,

Kesiraju²,

Novotný³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We launched a community platform for collecting the ATC speech world-wide in the ATCO 2 project. Filtering out unseen non-English speech is one of the main components in the data processing pipeline. The proposed English Language Detection (ELD) system is based on the embeddings from Bayesian subspace multinomial model. It is trained on the word confusion network from an ASR system. It is robust, easy to train, and light weighted. We achieved 0.0439 equal-error-rate (EER), a 50% relative reduction as compared to the state-of-the-art acoustic ELD system based on x-vectors, in the in-domain scenario. Further, we achieved an EER of 0.1352, a 33% relative reduction as compared to the acoustic ELD, in the unseen language (out-of-domain) condition. We plan to publish the evaluation dataset from the ATCO 2 project.

show abstract

Boosting of Contextual Information in ASR for Air-Traffic Call-Sign Recognition

Cited by 11 publications

References 0 publications

How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

Call-Sign Recognition and Understanding for Noisy Air-Traffic Transcripts Using Surveillance Information

Detecting English Speech in the Air Traffic Control Voice Communication

Contact Info

Product

Resources

About