Proceedings of the 22nd ACM International Conference on Multimedia 2014
DOI: 10.1145/2647868.2655026
|View full text |Cite
|
Sign up to set email alerts
|

Discriminating Native from Non-Native Speech Using Fusion of Visual Cues

Abstract: The task of classifying accent, as belonging to a native language speaker or a foreign language speaker, has been so far addressed by means of the audio modality only. However, features extracted from the visual modality have been successfully used to extend or substitute audio-only approaches developed for speech or language recognition. This paper presents a fully automated approach to discriminating native from non-native speech in English, based exclusively on visual appearance features from speech. Long S… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2016
2016
2018
2018

Publication Types

Select...
3
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 10 publications
0
3
0
Order By: Relevance
“…As a matter of fact, the latter have been shown to outperform uni-modal frameworks in various related tasks such as continuous interest prediction [40,16], detection of behavioral mimicry [41], and dimensional and continuous affect prediction [39], to mention but a few. Notably, other challenging problems such as accent classification [42,43,44] and pain intensity estimation [45] have been addressed based exclusively on visual features.…”
Section: Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…As a matter of fact, the latter have been shown to outperform uni-modal frameworks in various related tasks such as continuous interest prediction [40,16], detection of behavioral mimicry [41], and dimensional and continuous affect prediction [39], to mention but a few. Notably, other challenging problems such as accent classification [42,43,44] and pain intensity estimation [45] have been addressed based exclusively on visual features.…”
Section: Featuresmentioning
confidence: 99%
“…LSTMs [69] constitute an extension of the traditional Recurrent Neural Network architecture that is efficient in capturing contextual statistical regularities with large and unknown lags in time-series data. LSTMs have been successfully applied to various behavioral and affective computing tasks such as continuous and dimensional affect prediction [70,39], visual-only accent classification [43], and audio-visual depression scale prediction [71]. Herein, we use bi-directional LSTMs with 1 hidden layer of 128 memory blocks.…”
Section: Accepted M Manuscriptmentioning
confidence: 99%
“…To a large extent, these advances have been possible thanks to the construction of powerful systems based on Deep Learning (DL) architectures that have quickly started to replace traditional systems and to the availability of large-scale databases [19,16]. In 120 this way, technological advances in ALR systems have made possible several novel applications such as dictating messages to smartphones in noisy environments [38,39], using visual silent passwords [40, 41,42], discriminating between native and non-native speakers 125 [43,44,45], transcribing and re-dubbing silent films [16,34], synthesizing voice for people with speech disabilities based on their lip movements [46,47,48,49], developing augmented lip views to assist people with hearing impairments [50] or resolving multi-talker si-…”
mentioning
confidence: 99%