Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.216
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal and Multiresolution Speech Recognition with Transformers

Abstract: This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture. We particularly focus on the scene context provided by the visual information, to ground the ASR. We extract representations for audio features in the encoder layers of the transformer and fuse video features using an additional crossmodal multihead attention layer. Additionally, we incorporate a multitask training criterion for multiresolution ASR, where we train the model to generate both … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 36 publications
(18 citation statements)
references
References 24 publications
0
18
0
Order By: Relevance
“…As stated in previous studies (Vaswani et al, 2017 ; Paraskevopoulos et al, 2020 ), the attention function can be described as mapping a query and a set of key/value pairs to an output, where the query, keys, values, and output are vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…As stated in previous studies (Vaswani et al, 2017 ; Paraskevopoulos et al, 2020 ), the attention function can be described as mapping a query and a set of key/value pairs to an output, where the query, keys, values, and output are vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.…”
Section: Methodsmentioning
confidence: 99%
“…Specifically, cross-modal attention can dynamically adapt the streams from one modality to another and correlate meaningful elements across these two modalities (Peng et al, 2017;Ji et al, 2020). In addition, previous studies (Anderson et al, 2018;Yuan and Peng, 2019;Paraskevopoulos et al, 2020;Xu et al, 2020) have shown that the cross-modal attention mechanism can achieve better performance than the state-of-the-art methods in the multimedia field. Therefore, we develop a model with cross-modal attention to fully explore the correlations between audio and EEG signals, so as to solve the AAD problem in this study.…”
Section: Cross-modal Attentionmentioning
confidence: 99%
“…or boost performance in traditionally unimodal applications (e.g. Machine Translation [3], Speech Recognition [4,5] etc.). Moreover, modern advances in neuroscience and psychology hint that multi-sensory inputs are crucial for cognitive functions [6], even since infancy [7].…”
Section: Introductionmentioning
confidence: 99%
“…Transformers [21] are powerful neural architectures that lately have been used in ASR [22][23][24], SLU [25], and other audio-visual applications [26] with great success, mainly due to their attention mechanism. Only until recently, the attention concept has also been applied to beamforming, specifically for speech and noise mask estimations [9,27].…”
Section: Introductionmentioning
confidence: 99%