Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

Lee, Yong-Hyeok; Jang, Dong-Won; Kim, Jae-Bin; Park, Rae-Hong; Park, Hyung‐Min

doi:10.3390/app10207263

Cited by 15 publications

(9 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is important for voice recognition technologies to be of high quality and to enable people to express themselves more accurately. In order to In this paper [7], they proposed an AVSR model based on the transformer with the DCM attention and a hybrid CTC/attention architecture. We constructed the DCM attention for proper alignment information between audio and visual modality even with noisy reverberant audio data, and applied a hybrid CTC/attention structure to enhance monotonic alignments.…”

Section: Literature Surveymentioning

confidence: 99%

A Brief Survey on Visual Speech Recognition Using CNN

2023

IRJMETS

View full text Add to dashboard Cite

The audio-visual speech recognition method using lip movement extracted from side-face images to attempt to increase noise-robustness in mobile environments. Although most previous bimodal speech recognition methods use frontal face (lip) images, these methods are not easy for users since they need to hold a device with a camera in front of their face when talking. Our proposed method capturing lip movement using a small camera installed in a handset is more natural, easy and convenient. Visual features are extracted combined with audio features in the framework of CNN-based recognition.

show abstract

Section: Literature Surveymentioning

confidence: 99%

A Brief Survey on Visual Speech Recognition Using CNN

2023

IRJMETS

View full text Add to dashboard Cite

show abstract

“…The field has been rapidly developing since then. Most of the works are devoted into the architectural improvements, for example, Zhang et al (2019) proposed temporal focal block and spatio-temporal fusion, and Lee et al (2020) explored to use crossmodality attentions with Transformer.…”

Section: Audio-visual Speech Recognitionmentioning

confidence: 99%

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Pan¹,

Chen²,

Gong³

et al. 2022

Preprint

View full text Add to dashboard Cite

Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to make use of unlabelled unimodal data. On the other side, although the effectiveness of large-scale self-supervised learning is well established in both audio and visual modalities, how to integrate those pretrained models into a multimodal scenario remains underexplored. In this work, we successfully leverage unimodal self-supervised learning to promote the multimodal AVSR. In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding. We show that both components inherited from unimodal selfsupervised learning cooperate well, resulting in that the multimodal framework yields competitive results through fine-tuning. Our model is experimentally validated on both word-level and sentence-level tasks. Especially, even without an external language model, our proposed model raises the state-of-the-art performances on the widely accepted Lip Reading Sentences 2 (LRS2) dataset by a large margin, with a relative improvement of 30%.

show abstract

“…In particular, we apply dual cross-modal attention (DCMA) in the decoder part, which is the first trial for multi-task learning including the SELD task as far as we know, although DCMA has been used in multi-modal tasks such as audio-visual speech recognition and audio-text emotion detection [21,22]. Related information between features for SED and DOAE may be helpful for the SELD task, which needs to predict the class and direction of a specific sound event simultaneously.…”

Section: Our Contributionsmentioning

confidence: 99%

A Method Based on Dual Cross-Modal Attention and Parameter Sharing for Polyphonic Sound Event Localization and Detection

et al. 2022

Self Cite

View full text Add to dashboard Cite

Sound event localization and detection (SELD) is a joint task that unifies sound event detection (SED) and direction-of-arrival estimation (DOAE). The task has become such a popular topic that it was introduced into the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) Task3 in 2019. In this paper, we propose a method based on dual cross-modal attention (DCMA) and parameter sharing to simultaneously detect and localize sound events. In particular, the DCMA-based decoder commonly used for multiple predictions efficiently learns the associations between SED and DOAE features by exchanging SED and DOAE information in the process of attention, in addition to the encoder with parameter sharing. Furthermore, acoustic features that have not been usually used in the SELD task are additionally adopted to improve the performance, and data augmentation techniques of the mixup to simulate polyphonic events and channel rotation for spatial augmentation are conducted for this task. Experimental results demonstrate that our efficient model using one common decoder block based on the DCMA to predict multiple events in the track-wise output format is effective for the SELD task with up to three overlapping events.

show abstract

Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

Cited by 15 publications

References 31 publications

A Brief Survey on Visual Speech Recognition Using CNN

A Brief Survey on Visual Speech Recognition Using CNN

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

A Method Based on Dual Cross-Modal Attention and Parameter Sharing for Polyphonic Sound Event Localization and Detection

Contact Info

Product

Resources

About