End-to-End Multi-Channel Transformer for Speech Recognition

Chang, Feng-Ju; Radfar, Martin; Mouchtaris, Athanasios; King, Brian; Kunzmann, Siegfried

doi:10.1109/icassp39728.2021.9414123

Cited by 17 publications

(5 citation statements)

References 36 publications

(59 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In their work, they modified LSTM cells to learn the interactions between multiple channels by partitioning the memory cell using predetermined view interaction terms. Similarly, Camgoz et al ( 2020a ) employed multi-channel transformers for the SLT task, where the architecture learns from multiple channels using a modified Transformer architecture (Chang et al, 2021 ). Recently, Li and Meng ( 2022 ) proposed a Transformer-based multi-channel architecture using the information from the entire frame and skeleton input data for the SLT task.…”

Section: Related Workmentioning

confidence: 99%

Multi-cue temporal modeling for skeleton-based sign language recognition

2023

View full text Add to dashboard Cite

Sign languages are visual languages used as the primary communication medium for the Deaf community. The signs comprise manual and non-manual articulators such as hand shapes, upper body movement, and facial expressions. Sign Language Recognition (SLR) aims to learn spatial and temporal representations from the videos of the signs. Most SLR studies focus on manual features often extracted from the shape of the dominant hand or the entire frame. However, facial expressions combined with hand and body gestures may also play a significant role in discriminating the context represented in the sign videos. In this study, we propose an isolated SLR framework based on Spatial-Temporal Graph Convolutional Networks (ST-GCNs) and Multi-Cue Long Short-Term Memorys (MC-LSTMs) to exploit multi-articulatory (e.g., body, hands, and face) information for recognizing sign glosses. We train an ST-GCN model for learning representations from the upper body and hands. Meanwhile, spatial embeddings of hand shape and facial expression cues are extracted from Convolutional Neural Networks (CNNs) pre-trained on large-scale hand and facial expression datasets. Thus, the proposed framework coupling ST-GCNs with MC-LSTMs for multi-articulatory temporal modeling can provide insights into the contribution of each visual Sign Language (SL) cue to recognition performance. To evaluate the proposed framework, we conducted extensive analyzes on two Turkish SL benchmark datasets with different linguistic properties, BosphorusSign22k and AUTSL. While we obtained comparable recognition performance with the skeleton-based state-of-the-art, we observe that incorporating multiple visual SL cues improves the recognition performance, especially in certain sign classes where multi-cue information is vital. The code is available at: https://github.com/ogulcanozdemir/multicue-slr.

show abstract

Section: Related Workmentioning

confidence: 99%

Multi-cue temporal modeling for skeleton-based sign language recognition

2023

View full text Add to dashboard Cite

show abstract

“…The attention mechanism can be naturally introduced into audio and visual tasks, as well as audio-visual fusion tasks [14,[53][54][55][56]. However, the transformer architecture applied to AVKWS has yet to be studied.…”

Section: Transformer-based Modelmentioning

confidence: 99%

Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting

Jia-le

Wang

et al. 2023

CAAI Trans on Intel Tech

View full text Add to dashboard Cite

As one of the most effective methods to improve the accuracy and robustness of speech tasks, the audio–visual fusion approach has recently been introduced into the field of Keyword Spotting (KWS). However, existing audio–visual keyword spotting models are limited to detecting isolated words, while keyword spotting for unconstrained speech is still a challenging problem. To this end, an Audio–Visual Keyword Transformer (AVKT) network is proposed to spot keywords in unconstrained video clips. The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable‐length audio and visual inputs. The outputs of audio and visual branches are combined in a decision fusion module. As humans can easily notice whether a keyword appears in a sentence or not, our AVKT network can detect whether a video clip with a spoken sentence contains a pre‐specified keyword. Moreover, the position of the keyword is localised in the attention map without additional position labels. Experimental results on the LRS2‐KWS dataset and our newly collected PKU‐KWS dataset show that the accuracy of AVKT exceeded 99% in clean scenes and 85% in extremely noisy conditions. The code is available at https://github.com/jialeren/AVKT.

show abstract

“…The self-attention mechanism within the transformer captures the relationships between input and output data and supports parallel processing of sequence recurrent networks. Transformers have recently been employed in many applications, including natural language processing and computer vision, to name a few [16], [18], [19]. In this work, we employ transformers within the proposed GNN for the task of identifying and eliminating the noise associated with events generated by DVS.…”

Section: Introductionmentioning

confidence: 99%

Neuromorphic Camera Denoising Using Graph Neural Network-Driven Transformers

Alkendi

Azzam

Ayyad

et al. 2024

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

Neuromorphic vision is a bio-inspired technology 1 that has triggered a paradigm shift in the computer vision 2 community and is serving as a key enabler for a wide range of 3 applications. This technology has offered significant advantages, 4 including reduced power consumption, reduced processing needs, 5 and communication speedups. However, neuromorphic cameras 6 suffer from significant amounts of measurement noise. This 7 noise deteriorates the performance of neuromorphic event-based 8 perception and navigation algorithms. In this article, we propose 9 a novel noise filtration algorithm to eliminate events that do 10 not represent real log-intensity variations in the observed scene. 11 We employ a graph neural network (GNN)-driven transformer 12 algorithm, called GNN-Transformer, to classify every active event 13 pixel in the raw stream into real log-intensity variation or 14 noise. Within the GNN, a message-passing framework, referred 15 to as EventConv, is carried out to reflect the spatiotemporal 16 correlation among the events while preserving their asynchronous 17 nature. We also introduce the known-object ground-truth label-18 ing (KoGTL) approach for generating approximate ground-truth 19 labels of event streams under various illumination conditions. 20 KoGTL is used to generate labeled datasets, from experiments 21 recorded in challenging lighting conditions, including moon light. 22 These datasets are used to train and extensively test our proposed 23 algorithm. When tested on unseen datasets, the proposed algo-24 rithm outperforms state-of-the-art methods by at least 8.8% in 25 terms of filtration accuracy. Additional tests are also conducted 26 on publicly available datasets (ETH Zürich Color-DAVIS346 27 datasets) to demonstrate the generalization capabilities of the 28 proposed algorithm in the presence of illumination variations 29 and different motion dynamics. Compared to state-of-the-art 30 solutions, qualitative results verified the superior capability of 31 the proposed algorithm to eliminate noise while preserving 32 meaningful events in the scene.

show abstract

End-to-End Multi-Channel Transformer for Speech Recognition

Cited by 17 publications

References 36 publications

Multi-cue temporal modeling for skeleton-based sign language recognition

Multi-cue temporal modeling for skeleton-based sign language recognition

Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting

Neuromorphic Camera Denoising Using Graph Neural Network-Driven Transformers

Contact Info

Product

Resources

About