ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems

Lin, Yi; Yang, Bo; Li, Fulin; Guo, Dongyue; Zhang, Jianwei; Hu, Chen; Zhang, Yi

doi:10.1016/j.asoc.2021.107847

Cited by 17 publications

(5 citation statements)

References 46 publications

(59 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this way, the complicated architecture is supposed to impose training burdens and lead to the gradient vanishing problem. In this work, the residual mechanism is applied to the LSTM layers to improve its FIGURE 4 The residual BLSTM scheme trainability, and further to obtain better model convergence and final performance. In addition, since the ASR is a sequential classification task, the bidirectional mechanism is also performed on the LSTM layers to formulate the BLSTM layer, which benefits to improve the modelling accuracy from the past and future direction simultaneously.…”

Section: Residual Lstmsmentioning

confidence: 99%

“…In the current ATC management system, the ATC is a non-automatic procedure (human-in-the-loop) and is always regarded as a potential risk for the air traffic operation [1]. Numerous studies have demonstrated that monitoring the control conversation is a promising way to obtain real-time traffic dynamics [2,3,4], which benefits to formulate a closed-loop ATC management. To this end, the automatic speech recognition (ASR) technique, with the purpose of building the bridge between the human (ATCO and pilot) and machine (ATC systems), has attracted significant attention worldwide in the ATC domain.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards multilingual end‐to‐end speech recognition for air traffic control

Lin

Yang

Guo

et al. 2021

IET Intelligent Trans Sys

Self Cite

View full text Add to dashboard Cite

In this work, an end-to-end framework is proposed to achieve multilingual automatic speech recognition (ASR) in air traffic control (ATC) systems. Considering the standard ATC procedure, a recurrent neural network (RNN) based framework is selected to mine the temporal dependencies among speech frames. Facing the distributed feature space caused by the radio transmission, a hybrid feature embedding block is designed to extract high-level representations, in which multiple convolutional neural networks are designed to accommodate different frequency and temporal resolutions. The residual mechanism is performed on the RNN layers to improve the trainability and the convergence. To integrate the multilingual ASR into a single model and relieve the class imbalance, a special vocabulary is designed to unify the pronunciation of the vocabulary in Chinese and English, i.e., pronunciation-oriented vocabulary. The proposed model is optimized by the connectionist temporal classification loss and is validated on a real-world speech corpus (ATC-Speech). A character error rate of 4.4% and 5.9% is achieved for Chinese and English speech, respectively, which outperforms other popular approaches. Most importantly, the proposed approach achieves the multilingual ASR task in an end-to-end manner with considerable high performance. INTRODUCTIONAir traffic control (ATC) is an essential service provided by ground-based air traffic controllers (ATCOs) to guide the flight to be operated in a safe manner (i.e. prevent conflict), and further to organize and expedite the traffic flow. As the primary communication way between the ATCO and the aircrew, the spoken instruction through the very high frequency (VHF) radio transmission implies a wealth of contextualized situational information, which is important to the real-time ATC decisionmaking. In the current ATC management system, the ATC is a non-automatic procedure (human-in-the-loop) and is always regarded as a potential risk for the air traffic operation [1]. Numerous studies have demonstrated that monitoring the control conversation is a promising way to obtain real-time traffic dynamics [2,3,4], which benefits to formulate a closed-loop ATC management. To this end, the automatic speech recognition (ASR) technique, with the purpose of building the bridgeThis is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

show abstract

Section: Residual Lstmsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Towards multilingual end‐to‐end speech recognition for air traffic control

Lin

Yang

Guo

et al. 2021

IET Intelligent Trans Sys

Self Cite

View full text Add to dashboard Cite

show abstract

“…Considering that the end-to-end ASR systems are often the most efficient method and deliver competitive quality in recent years [12,22,24,34], a connectionist temporal classification (CTC) based model referring to Deepspeech 2 [35] is introduced to serve as the AM in this work. In general, the AM model consists of convolutional neural networks (CNN), recurrent neural network (RNN), and fully connected (FC) layers.…”

Section: The Acoustic Modelmentioning

confidence: 99%

“…An exploratory benchmark of several advanced ASR models trained on ATC corpus was presented in [10]. Semi-supervised Learning [11] and representation learning [12,13] approaches were also introduced to leverage abundant untranscribed speech data to improve ASR performance in the ATC domain. Furthermore, an ASR and callsign detection challenge of the ATC was held by the Airbus company in 2018 [14].…”

Section: Introductionmentioning

confidence: 99%

“…Although significant progress of the ASR performance has been made in the ATC domain [9][10][11][12][13][14][15], recognizing isolated digits of the callsign is a challenging task in the ATC domain due to their widespread usage and ambiguous meanings [8]. For example, an ATC instruction Air China four four one climb maintain eight thousand one hundred meters contains multiple digits, the four four one is a part of the callsign while the eight thousand one refers to the flight level.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Context-Aware Language Model to Improve the Speech Recognition in Air Traffic Control

et al. 2021

Self Cite

View full text Add to dashboard Cite

Recognizing isolated digits of the flight callsign is an important and challenging task for automatic speech recognition (ASR) in air traffic control (ATC). Fortunately, the flight callsign is a kind of prior ATC knowledge and is available from dynamic contextual information. In this work, we attempt to utilize this prior knowledge to improve the performance of the callsign identification by integrating it into the language model (LM). The proposed approach is named context-aware language model (CALM), which can be applied for both the ASR decoding and rescoring phase. The proposed model is implemented with an encoder–decoder architecture, in which an extra context encoder is proposed to consider the contextual information. A shared embedding layer is designed to capture the correlations between the ASR text and contextual information. The context attention is introduced to learn discriminative representations to support the decoder module. Finally, the proposed approach is validated with an end-to-end ASR model on a multilingual real-world corpus (ATCSpeech). Experimental results demonstrate that the proposed CALM outperforms other baselines for both the ASR and callsign identification task, and can be practically migrated to a real-time environment.

show abstract

Chinese dialect speech recognition: a comprehensive survey

Li,

Mai,

Wang

et al. 2024

Artif Intell Rev

View full text Add to dashboard Cite

As a multi-ethnic country with a large population, China is endowed with diverse dialects, which brings considerable challenges to speech recognition work. In fact, due to geographical location, population migration, and other factors, the research progress and practical application of Chinese dialect speech recognition are currently at different stages. Therefore, exploring the significant regional heterogeneities in specific recognition approaches and effects, dialect corpus, and other resources is of vital importance for Chinese speech recognition work. Based on this, we first start with the regional classification of dialects and analyze the pivotal acoustic characteristics of dialects, including specific vowels and tones patterns. Secondly, we comprehensively summarize the existing dialect phonetic corpus in China, which is of some assistance in exploring the general construction methods of dialect phonetic corpus. Moreover, we expound on the general process of dialect recognition. Several critical dialect recognition approaches are summarized and introduced in detail, especially the hybrid method of Artificial Neural Network (ANN) combined with the Hidden Markov Model(HMM), as well as the End-to-End (E2E). Thirdly, through the in-depth comparison of their principles, merits, disadvantages, and recognition performance for different dialects, the development trends and challenges in dialect recognition in the future are pointed out. Finally, some application examples of dialect speech recognition are collected and discussed.

show abstract

ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems

Cited by 17 publications

References 46 publications

Towards multilingual end‐to‐end speech recognition for air traffic control

Towards multilingual end‐to‐end speech recognition for air traffic control

A Context-Aware Language Model to Improve the Speech Recognition in Air Traffic Control

Chinese dialect speech recognition: a comprehensive survey

Contact Info

Product

Resources

About