A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition

Ivanko, Denis; Ryumin, Dmitry; Karpov, Alexey

doi:10.3390/math11122665

Cited by 7 publications

(6 citation statements)

References 145 publications

(227 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Personal data processing means any operation that is performed on personal data, including the collection, recording, structuring, storage, adaptation or alteration, retrieval, use, disclosure by transmission, dissemination, combination and erasure (see Art. 4(2) in the GDPR). The entity that, alone or jointly with others, determines the purposes and means of the processing of personal data, is the so-called data controller, whereas the entity that processes personal data on behalf of the controller is the data processor (see Art.…”

Section: Personal Data Protection-legal Provisionsmentioning

confidence: 99%

“…Machine learning (ML) and, especially, deep learning algorithms, as a type of artificial intelligence (AI) algorithms, are widely used in various fields, such as digital image processing [1], data analytics [2], autonomous systems [3], text and speech recognition [4], face recognition [5], robotics [6], traffic prediction [7], intrusion detection [8], etc. It is still a highly evolving field with a variety of innovative applications [9,10].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Data Protection Issues in Automated Decision-Making Systems Based on Machine Learning: Research Challenges

Christodoulou,

Limniotis

2024

Network

View full text Add to dashboard Cite

Data protection issues stemming from the use of machine learning algorithms that are used in automated decision-making systems are discussed in this paper. More precisely, the main challenges in this area are presented, putting emphasis on how important it is to simultaneously ensure the accuracy of the algorithms as well as privacy and personal data protection for the individuals whose data are used for training the corresponding models. In this respect, we also discuss how specific well-known data protection attacks that can be mounted in processes based on such algorithms are associated with a lack of specific legal safeguards; to this end, the General Data Protection Regulation (GDPR) is used as the basis for our evaluation. In relation to these attacks, some important privacy-enhancing techniques in this field are also surveyed. Moreover, focusing explicitly on deep learning algorithms as a type of machine learning algorithm, we further elaborate on one such privacy-enhancing technique, namely, the application of differential privacy to the training dataset. In this respect, we present, through an extensive set of experiments, the main difficulties that occur if one needs to demonstrate that such a privacy-enhancing technique is, indeed, sufficient to mitigate all the risks for the fundamental rights of individuals. More precisely, although we manage—by the proper configuration of several algorithms’ parameters—to achieve accuracy at about 90% for specific privacy thresholds, it becomes evident that even these values for accuracy and privacy may be unacceptable if a deep learning algorithm is to be used for making decisions concerning individuals. The paper concludes with a discussion of the current challenges and future steps, both from a legal as well as from a technical perspective.

show abstract

Section: Personal Data Protection-legal Provisionsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Data Protection Issues in Automated Decision-Making Systems Based on Machine Learning: Research Challenges

Christodoulou,

Limniotis

2024

Network

View full text Add to dashboard Cite

show abstract

“…In this subsection, we evaluate recent progress in VSR methodology. More detailed research on these issues can be found in related studies [68,69].…”

Section: State-of-the-art Approaches For Lip-readingmentioning

confidence: 99%

“…We have reviewed the 28 existing benchmarking corpora for VSR and Audio-Visual Speech Recognition (AVSR) suitable for training DL models in our related study [68]. However, all these existing publicly available corpora (both recorded in laboratory conditions and collected from the internet) contain speech recordings only with neutral emotions, or with a minimal amount of emotions/data not labeled with emotion classes.…”

Section: Research Corporamentioning

confidence: 99%

EMOLIPS: Towards Reliable Emotional Speech Lip-Reading

Ryumin,

Ryumina,

Ivanko

2023

Mathematics

Self Cite

View full text Add to dashboard Cite

In this article, we present a novel approach for emotional speech lip-reading (EMOLIPS). This two-level approach to emotional speech to text recognition based on visual data processing is motivated by human perception and the recent developments in multimodal deep learning. The proposed approach uses visual speech data to determine the type of speech emotion. The speech data are then processed using one of the emotional lip-reading models trained from scratch. This essentially resolves the multi-emotional lip-reading issue associated with most real-life scenarios. We implemented these models as a combination of EMO-3DCNN-GRU architecture for emotion recognition and 3DCNN-BiLSTM architecture for automatic lip-reading. We evaluated the models on the CREMA-D and RAVDESS emotional speech corpora. In addition, this article provides a detailed review of recent advances in automated lip-reading and emotion recognition that have been developed over the last 5 years (2018–2023). In comparison to existing research, we mainly focus on the valuable progress brought with the introduction of deep learning to the field and skip the description of traditional approaches. The EMOLIPS approach significantly improves the state-of-the-art accuracy for phrase recognition due to considering emotional features of the pronounced audio-visual speech up to 91.9% and 90.9% for RAVDESS and CREMA-D, respectively. Moreover, we present an extensive experimental investigation that demonstrates how different emotions (happiness, anger, disgust, fear, sadness, and neutral), valence (positive, neutral, and negative) and binary (emotional and neutral) affect automatic lip-reading.

show abstract

“…The face regions were detected using the popular FaceMesh model [49] from the MediaPipe [50] open source library. We resized the face region images to 224×224 and applied channel normalization.…”

Section: Visual-based Affective States Recognitionmentioning

confidence: 99%

Multi-Corpus Learning for Audio–Visual Emotions and Sentiment Recognition

2023

Self Cite

View full text Add to dashboard Cite

Recognition of emotions and sentiment (affective states) from human audio–visual information is widely used in healthcare, education, entertainment, and other fields; therefore, it has become a highly active research area. The large variety of corpora with heterogeneous data available for the development of single-corpus approaches for recognition of affective states may lead to approaches trained on one corpus being less effective on another. In this article, we propose a multi-corpus learned audio–visual approach for emotion and sentiment recognition. It is based on the extraction of mid-level features at the segment level using two multi-corpus temporal models (a pretrained transformer with GRU layers for the audio modality and pre-trained 3D CNN with BiLSTM-Former for the video modality) and on predicting affective states using two single-corpus cross-modal gated self-attention fusion (CMGSAF) models. The proposed approach was tested on the RAMAS and CMU-MOSEI corpora. To date, our approach has outperformed state-of-the-art audio–visual approaches for emotion recognition by 18.2% (78.1% vs. 59.9%) for the CMU-MOSEI corpus in terms of the Weighted Accuracy and by 0.7% (82.8% vs. 82.1%) for the RAMAS corpus in terms of the Unweighted Average Recall.

show abstract

A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition

Cited by 7 publications

References 145 publications

Data Protection Issues in Automated Decision-Making Systems Based on Machine Learning: Research Challenges

Data Protection Issues in Automated Decision-Making Systems Based on Machine Learning: Research Challenges

EMOLIPS: Towards Reliable Emotional Speech Lip-Reading

Multi-Corpus Learning for Audio–Visual Emotions and Sentiment Recognition

Contact Info

Product

Resources

About