Discriminating Native from Non-Native Speech Using Fusion of Visual Cues

Georgakis, Christos; Petridis, Stavros; Pantić, Maja

doi:10.1145/2647868.2655026

Cited by 4 publications

(3 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a matter of fact, the latter have been shown to outperform uni-modal frameworks in various related tasks such as continuous interest prediction [40,16], detection of behavioral mimicry [41], and dimensional and continuous affect prediction [39], to mention but a few. Notably, other challenging problems such as accent classification [42,43,44] and pain intensity estimation [45] have been addressed based exclusively on visual features.…”

Section: Featuresmentioning

confidence: 99%

“…LSTMs [69] constitute an extension of the traditional Recurrent Neural Network architecture that is efficient in capturing contextual statistical regularities with large and unknown lags in time-series data. LSTMs have been successfully applied to various behavioral and affective computing tasks such as continuous and dimensional affect prediction [70,39], visual-only accent classification [43], and audio-visual depression scale prediction [71]. Herein, we use bi-directional LSTMs with 1 hidden layer of 128 memory blocks.…”

Section: Accepted M Manuscriptmentioning

confidence: 99%

See 1 more Smart Citation

The Conflict Escalation Resolution (CONFER) Database

Georgakis

Panagakis

Zafeiriou

et al. 2017

Image and Vision Computing

Self Cite

View full text Add to dashboard Cite

Conflict is usually defined as a high level of disagreement taking place when individuals act on incompatible goals, interests, or intentions. Research in human sciences has recognized conflict as one of the main dimensions along which an interaction is perceived and assessed. Hence, automatic estimation of conflict intensity in naturalistic conversations would be a valuable tool for the advancement of human-centered computing and the deployment of novel applications for social skills enhancement including conflict management and negotiation. However, machine analysis of conflict is still limited to just a few works, partially due to an overall lack of suitable annotated data, while it has been mostly approached as a conflict or (dis)agreement detection problem based on audio features only. In this work, we aim to overcome the aforementioned limitations by a) presenting the Conflict Escalation Resolution (CONFER) Database, a set of excerpts from audio-visual recordings of televised political debates where conflicts naturally arise, and b) reporting baseline experiments on audio-visual conflict intensity estimation. The database contains approximately 142 minutes of recordings in Greek language, split over 120 non-overlapping episodes of naturalistic conversations that involve two or three interactants. Subject-and session-independent experiments are conducted on continuous-time (frame-by-frame) estimation of real-valued conflict intensity, as opposed to binary conflict/non-conflict clas- * Corresponding author. E-mail address: christos.georgakis@imperial.ac.uk Preprint submitted to Image and Vision Computing December 20, 2016 A C C E P T E D M A N U S C R I P T ACCEPTED MANUSCRIPTsification. For the problem at hand, the efficiency of various audio and visual features and fusion of them as well as various regression frameworks is examined. Experimental results suggest that there is much room for improvement in the design and development of automated multi-modal approaches to continuous conflict analysis. The CONFER Database is publicly available for non-commercial use at http://ibug.doc.ic.ac.uk/resources/confer/.

show abstract

Section: Featuresmentioning

confidence: 99%

Section: Accepted M Manuscriptmentioning

confidence: 99%

The Conflict Escalation Resolution (CONFER) Database

Georgakis

Panagakis

Zafeiriou

et al. 2017

Image and Vision Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…To a large extent, these advances have been possible thanks to the construction of powerful systems based on Deep Learning (DL) architectures that have quickly started to replace traditional systems and to the availability of large-scale databases [19,16]. In 120 this way, technological advances in ALR systems have made possible several novel applications such as dictating messages to smartphones in noisy environments [38,39], using visual silent passwords [40, 41,42], discriminating between native and non-native speakers 125 [43,44,45], transcribing and re-dubbing silent films [16,34], synthesizing voice for people with speech disabilities based on their lip movements [46,47,48,49], developing augmented lip views to assist people with hearing impairments [50] or resolving multi-talker si-…”

mentioning

confidence: 99%

Survey on automatic lip-reading in the era of deep learning

Fernandez-Lopez

Sukno

2018

Image and Vision Computing

View full text Add to dashboard Cite

In the last few years, there has been an increasing interest in developing systems for Automatic LipReading (ALR). Similarly to other computer vision applications, methods based on Deep Learning (DL) have become very popular and have permitted to substantially push forward the achievable performance. In this survey, we review ALR research during the last decade, highlighting the progression from approaches previous to DL (which we refer to as traditional) toward end-to-end DL architectures. We provide a comprehensive list of the audiovisual databases available for lipreading, describing what tasks they can be used for, their popularity and their most important characteristics, such as the number of speakers, vocabulary size, recording settings and total duration. In correspondence with the shift toward DL, we show that there is a clear tendency toward large-scale datasets targeting realistic application settings and large numbers of samples per class. On the other hand, we summarize, discuss and compare the different ALR systems proposed in the last decade, separately considering traditional and DL approaches. We address a quantitative analysis of the different systems by organizing them in terms of the task that they target (e.g. recognition of letters or digits and words or sentences) and comparing their reported performance in the most commonly used datasets. As a result, we find that DL architectures perform similarly to traditional ones for simpler tasks but report significant improvements in more complex tasks, such as word or sentence recognition, with up to 40% improvement in word recognition rates. Hence, we provide a detailed description of the available ALR systems based on end-to-end DL architectures and identify a tendency to focus on the modeling of temporal context as the key to advance the field. Such modeling is dominated by recurrent neural networks due to their ability to retain context at multiple scales (e.g. short-and longterm information). In this sense, current efforts tend toward techniques that allow a more comprehensive modeling and interpretability of the retained context.

show abstract

Discriminant Incoherent Component Analysis

Georgakis

Panagakis

Pantić

2016

IEEE Trans. on Image Process.

Self Cite

View full text Add to dashboard Cite

Abstract-Face images convey rich information which can be perceived as a superposition of low-complexity components associated with attributes, such as facial identity, expressions and activation of facial action units (AUs). For instance, low-rank components characterizing neutral facial images are associated with identity, while sparse components capturing non-rigid deformations occurring in certain face regions reveal expressions and action unit activations. In this paper, the Discriminant Incoherent Component Analysis (DICA) is proposed in order to extract lowcomplexity components corresponding to facial attributes, which are mutually incoherent among different classes (e.g., identity, expression, AU activation) from training data, even in the presence of gross sparse errors. To this end, a suitable optimization problem, involving the minimization of nuclear-and 1-norm, is solved. Having found an ensemble of class-specific incoherent components by the DICA, an unseen (test) image is expressed as a group-sparse linear combination of these components, where the non-zero coefficients reveal the class(es) of the respective facial attribute(s) that it belongs to. The performance of the DICA is experimentally assessed on both synthetic and real-world data. Emphasis is placed on face analysis tasks, namely joint face and expression recognition, face recognition under varying percentages of training data corruption, subject-independent expression recognition, and action unit detection by conducting experiments on 4 datasets. The proposed method outperforms all the methods that is compared to in all tasks and experimental settings.

show abstract

Discriminating Native from Non-Native Speech Using Fusion of Visual Cues

Cited by 4 publications

References 10 publications

The Conflict Escalation Resolution (CONFER) Database

The Conflict Escalation Resolution (CONFER) Database

Survey on automatic lip-reading in the era of deep learning

Discriminant Incoherent Component Analysis

Contact Info

Product

Resources

About