Multimodal and Multiresolution Speech Recognition with Transformers

Paraskevopoulos, Georgios; Parthasarathy, S.; Khare, Aparna; Sundaram, Shiva

doi:10.18653/v1/2020.acl-main.216

Cited by 36 publications

(18 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As stated in previous studies (Vaswani et al, 2017 ; Paraskevopoulos et al, 2020 ), the attention function can be described as mapping a query and a set of key/value pairs to an output, where the query, keys, values, and output are vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.…”

Section: Methodsmentioning

confidence: 99%

“…Specifically, cross-modal attention can dynamically adapt the streams from one modality to another and correlate meaningful elements across these two modalities (Peng et al, 2017;Ji et al, 2020). In addition, previous studies (Anderson et al, 2018;Yuan and Peng, 2019;Paraskevopoulos et al, 2020;Xu et al, 2020) have shown that the cross-modal attention mechanism can achieve better performance than the state-of-the-art methods in the multimedia field. Therefore, we develop a model with cross-modal attention to fully explore the correlations between audio and EEG signals, so as to solve the AAD problem in this study.…”

Section: Cross-modal Attentionmentioning

confidence: 99%

See 1 more Smart Citation

Auditory Attention Detection via Cross-Modal Attention

Cai

et al. 2021

Front. Neurosci.

View full text Add to dashboard Cite

Humans show a remarkable perceptual ability to select the speech stream of interest among multiple competing speakers. Previous studies demonstrated that auditory attention detection (AAD) can infer which speaker is attended by analyzing a listener's electroencephalography (EEG) activities. However, previous AAD approaches perform poorly on short signal segments, more advanced decoding strategies are needed to realize robust real-time AAD. In this study, we propose a novel approach, i.e., cross-modal attention-based AAD (CMAA), to exploit the discriminative features and the correlation between audio and EEG signals. With this mechanism, we hope to dynamically adapt the interactions and fuse cross-modal information by directly attending to audio and EEG features, thereby detecting the auditory attention activities manifested in brain signals. We also validate the CMAA model through data visualization and comprehensive experiments on a publicly available database. Experiments show that the CMAA achieves accuracy values of 82.8, 86.4, and 87.6% for 1-, 2-, and 5-s decision windows under anechoic conditions, respectively; for a 2-s decision window, it achieves an average of 84.1% under real-world reverberant conditions. The proposed CMAA network not only achieves better performance than the conventional linear model, but also outperforms the state-of-the-art non-linear approaches. These results and data visualization suggest that the CMAA model can dynamically adapt the interactions and fuse cross-modal information by directly attending to audio and EEG features in order to improve the AAD performance.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Cross-modal Attentionmentioning

confidence: 99%

Auditory Attention Detection via Cross-Modal Attention

Cai

et al. 2021

Front. Neurosci.

View full text Add to dashboard Cite

show abstract

“…or boost performance in traditionally unimodal applications (e.g. Machine Translation [3], Speech Recognition [4,5] etc.). Moreover, modern advances in neuroscience and psychology hint that multi-sensory inputs are crucial for cognitive functions [6], even since infancy [7].…”

Section: Introductionmentioning

confidence: 99%

MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis

Paraskevopoulos¹,

Georgiou²,

Potamianos³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high and mid-level latent modality representations (late/mid fusion) or low level sensory inputs (early fusion). Models of human perception highlight the importance of top-down fusion, where high-level representations affect the way sensory inputs are perceived, i.e. cognition affects perception. These top-down interactions are not captured in current deep learning models. In this work we propose a neural architecture that captures top-down crossmodal interactions, using a feedback mechanism in the forward pass during network training. The proposed mechanism extracts high-level representations for each modality and uses these representations to mask the sensory inputs, allowing the model to perform top-down feature masking. We apply the proposed model for multimodal sentiment recognition on CMU-MOSEI. Our method shows consistent improvements over the well established MulT and over our strong late fusion baseline, achieving state-of-the-art results.

show abstract

“…Transformers [21] are powerful neural architectures that lately have been used in ASR [22][23][24], SLU [25], and other audio-visual applications [26] with great success, mainly due to their attention mechanism. Only until recently, the attention concept has also been applied to beamforming, specifically for speech and noise mask estimations [9,27].…”

Section: Introductionmentioning

confidence: 99%

End-to-End Multi-Channel Transformer for Speech Recognition

Chang

Radfar

Mouchtaris

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three parts: channel-wise self attention layers (CSA), cross-channel attention layers (CCA), and multi-channel encoder-decoder attention layers (EDA). The CSA and CCA layers encode the contextual relationship "within" and "between" channels and across time, respectively. The channel-attended outputs from CSA and CCA are then fed into the EDA layers to help decode the next token given the preceding ones. The experiments show that in a far-field in-house dataset, our method outperforms the baseline single-channel transformer, as well as the super-directive and neural beamformers cascaded with the transformers.

show abstract

Multimodal and Multiresolution Speech Recognition with Transformers

Cited by 36 publications

References 24 publications

Auditory Attention Detection via Cross-Modal Attention

Auditory Attention Detection via Cross-Modal Attention

MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis

End-to-End Multi-Channel Transformer for Speech Recognition

Contact Info

Product

Resources

About