Multiresolution and Multimodal Speech Recognition with Transformers

Paraskevopoulos, Georgios; Parthasarathy, S.; Khare, Aparna; Sundaram, Shiva

doi:10.48550/arxiv.2004.14840

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multi-modal tasks were traditionally associated with visual question answering (Goyal et al, 2017), image captioning (Gurari et al, 2020), audiovisual speech recognition (Paraskevopoulos et al, 2020), or cross-modal retrieval (Wang et al, 2016). With success of competitions like the Hateful Memes Challenge (Kiela et al, 2020), more research focused on multi-modal offensive classification.…”

Section: Introductionmentioning

confidence: 99%

UPB at SemEval-2022 Task 5: Enhancing UNITER with Image Sentiment and Graph Convolutional Networks for Multimedia Automatic Misogyny Identification

Paraschiv¹,

Dascălu²,

Cercel³

2022

Preprint

View full text Add to dashboard Cite

In recent times, the detection of hate-speech, offensive, or abusive language in online media has become an important topic in NLP research due to the exponential growth of social media and the propagation of such messages, as well as their impact. Misogyny detection, even though it plays an important part in hatespeech detection, has not received the same attention. In this paper, we describe our classification systems submitted to the SemEval-2022 Task 5: MAMI -Multimedia Automatic Misogyny Identification. The shared task aimed to identify misogynous content in a multi-modal setting by analysing meme images together with their textual captions. To this end, we propose two models based on the pre-trained UNITER model, one enhanced with an image sentiment classifier, whereas the second leverages a Vocabulary Graph Convolutional Network (VGCN). Additionally, we explore an ensemble using the aforementioned models. Our best model reaches an F1-score of 71.4% in Sub-task A and 67.3% for Subtask B positioning our team in the upper third of the leaderboard. We release the code and experiments for our models on GitHub 1 .

show abstract

Section: Introductionmentioning

confidence: 99%

UPB at SemEval-2022 Task 5: Enhancing UNITER with Image Sentiment and Graph Convolutional Networks for Multimedia Automatic Misogyny Identification

Paraschiv¹,

Dascălu²,

Cercel³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Transformers [21] are powerful neural architectures that lately have been used in ASR [22][23][24], SLU [25], and other audio-visual applications [26] with great success, mainly due to their attention mechanism. Only until recently, the attention concept has also been applied to beamforming, specifically for speech and noise mask estimations [9,27].…”

Section: Introductionmentioning

confidence: 99%

End-to-End Multi-Channel Transformer for Speech Recognition

Chang

Radfar

Mouchtaris

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three parts: channel-wise self attention layers (CSA), cross-channel attention layers (CCA), and multi-channel encoder-decoder attention layers (EDA). The CSA and CCA layers encode the contextual relationship "within" and "between" channels and across time, respectively. The channel-attended outputs from CSA and CCA are then fed into the EDA layers to help decode the next token given the preceding ones. The experiments show that in a far-field in-house dataset, our method outperforms the baseline single-channel transformer, as well as the super-directive and neural beamformers cascaded with the transformers.

show abstract

Multiresolution and Multimodal Speech Recognition with Transformers

Cited by 2 publications

References 0 publications

UPB at SemEval-2022 Task 5: Enhancing UNITER with Image Sentiment and Graph Convolutional Networks for Multimedia Automatic Misogyny Identification

UPB at SemEval-2022 Task 5: Enhancing UNITER with Image Sentiment and Graph Convolutional Networks for Multimedia Automatic Misogyny Identification

End-to-End Multi-Channel Transformer for Speech Recognition

Contact Info

Product

Resources

About