2020
DOI: 10.3390/app10207263
|View full text |Cite
|
Sign up to set email alerts
|

Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

Abstract: Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence (seq2seq) problems with the LSTM. In contrast to the neural machine translation, audio–visual speech recognition (AVSR) may provide improved performance by learning the correlation between audio and visual modalities. As a result that the audio has richer information than the video related to l… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
4

Relationship

1
9

Authors

Journals

citations
Cited by 15 publications
(9 citation statements)
references
References 31 publications
0
4
0
Order By: Relevance
“…It is important for voice recognition technologies to be of high quality and to enable people to express themselves more accurately. In order to In this paper [7], they proposed an AVSR model based on the transformer with the DCM attention and a hybrid CTC/attention architecture. We constructed the DCM attention for proper alignment information between audio and visual modality even with noisy reverberant audio data, and applied a hybrid CTC/attention structure to enhance monotonic alignments.…”
Section: Literature Surveymentioning
confidence: 99%
“…It is important for voice recognition technologies to be of high quality and to enable people to express themselves more accurately. In order to In this paper [7], they proposed an AVSR model based on the transformer with the DCM attention and a hybrid CTC/attention architecture. We constructed the DCM attention for proper alignment information between audio and visual modality even with noisy reverberant audio data, and applied a hybrid CTC/attention structure to enhance monotonic alignments.…”
Section: Literature Surveymentioning
confidence: 99%
“…The field has been rapidly developing since then. Most of the works are devoted into the architectural improvements, for example, Zhang et al (2019) proposed temporal focal block and spatio-temporal fusion, and Lee et al (2020) explored to use crossmodality attentions with Transformer.…”
Section: Audio-visual Speech Recognitionmentioning
confidence: 99%
“…In particular, we apply dual cross-modal attention (DCMA) in the decoder part, which is the first trial for multi-task learning including the SELD task as far as we know, although DCMA has been used in multi-modal tasks such as audio-visual speech recognition and audio-text emotion detection [21,22]. Related information between features for SED and DOAE may be helpful for the SELD task, which needs to predict the class and direction of a specific sound event simultaneously.…”
Section: Our Contributionsmentioning
confidence: 99%