ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414123
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Multi-Channel Transformer for Speech Recognition

Abstract: Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three parts: channel-wise self attention layers (CSA), cross-channel attention layers (CCA), and multi-channel… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 17 publications
(5 citation statements)
references
References 36 publications
(59 reference statements)
0
5
0
Order By: Relevance
“…In their work, they modified LSTM cells to learn the interactions between multiple channels by partitioning the memory cell using predetermined view interaction terms. Similarly, Camgoz et al ( 2020a ) employed multi-channel transformers for the SLT task, where the architecture learns from multiple channels using a modified Transformer architecture (Chang et al, 2021 ). Recently, Li and Meng ( 2022 ) proposed a Transformer-based multi-channel architecture using the information from the entire frame and skeleton input data for the SLT task.…”
Section: Related Workmentioning
confidence: 99%
“…In their work, they modified LSTM cells to learn the interactions between multiple channels by partitioning the memory cell using predetermined view interaction terms. Similarly, Camgoz et al ( 2020a ) employed multi-channel transformers for the SLT task, where the architecture learns from multiple channels using a modified Transformer architecture (Chang et al, 2021 ). Recently, Li and Meng ( 2022 ) proposed a Transformer-based multi-channel architecture using the information from the entire frame and skeleton input data for the SLT task.…”
Section: Related Workmentioning
confidence: 99%
“…The attention mechanism can be naturally introduced into audio and visual tasks, as well as audio-visual fusion tasks [14,[53][54][55][56]. However, the transformer architecture applied to AVKWS has yet to be studied.…”
Section: Transformer-based Modelmentioning
confidence: 99%
“…The self-attention mechanism within the transformer captures the relationships between input and output data and supports parallel processing of sequence recurrent networks. Transformers have recently been employed in many applications, including natural language processing and computer vision, to name a few [16], [18], [19]. In this work, we employ transformers within the proposed GNN for the task of identifying and eliminating the noise associated with events generated by DVS.…”
Section: Introductionmentioning
confidence: 99%