Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition

Gurbuz, Sabri; Tüfekçi, Zekeriya; Patterson, Eric; Gowdy, J.N.

doi:10.1109/icassp.2001.940796

Cited by 27 publications

(20 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such improvements have been typically demonstrated on databases of small duration, and, in most cases, limited to a very small number of speakers (mostly less than ten, and often singlesubject) and to small vocabulary tasks [18], [21]. Common tasks typically include recognition of non-sense words [22], [23], isolated words [19], [24][25][26][27][28][29][30], connected digits [31], [32], letters [31], or of closed-set sentences [33], mostly in English, but also in French [22], [34], [35], German [36], [37], and Japanese [38], among others. Recently however, significant improvements have also been demonstrated for large vocabulary continuous speech recognition (LVCSR) [39], as well as cases of speech degraded due to speech impairment [40] or Lombard effects [29].…”

Section: Audio-only Asr Visual-only Asr ( Automatic Speechreadingmentioning

confidence: 99%

Recent advances in the automatic recognition of audiovisual speech

et al. 2003

View full text Add to dashboard Cite

Abstract-Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the later topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual speaker adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small-to largevocabulary recognition tasks, recorded at both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, however less so for visually challenging environments and large vocabulary tasks.

show abstract

Section: Audio-only Asr Visual-only Asr ( Automatic Speechreadingmentioning

confidence: 99%

Recent advances in the automatic recognition of audiovisual speech

et al. 2003

View full text Add to dashboard Cite

show abstract

“…Af f ine inva ria nt Fourie r de sc riptors ha ve be e n use d f or l ip re a ding [ 10] a nd f or re c ognition of a irc ra f ts [ 11] . Usa ge of a f f ine inva ria nt Fourie r de sc riptors in huma n posture e stima tion is a ne w a pproa c h e spe c ia l l y to a c tivity re c ognition.…”

Section: Affine Invariant Fourier Descriptorsmentioning

confidence: 99%

Human Activity Recognition

Labrador¹,

Yejas²

2013

View full text Add to dashboard Cite

This paper presents a system, which is able to recognize 15 dif f erent continuous human activ ities in real-time using a single stationary camera as input. The system can recognize activ ities such as raising or wav ing hand( s) , sitting down and bending down. The recognition is based on describing activ ities as a continuous sequence of discrete postures, which are deriv ed f rom af f ine inv ariant descriptors. Using af f ine inv ariant descriptors makes our system robust against such dif f erences in camera locations as distance f rom the obj ect and change in v iewing direction as these dif f erences can be considered to hav e the af f ect of near af f ine transf ormations as human silhouettes are considered.

show abstract

“…The researchers then obtained visual features, namely the affine-invariant Fourier descriptors (AIFDs) [21], the DCT, the rotation-corrected DCT (rc-DCT) and the B-Spline template (BST) [19]. Due to their greater sensitivity to lip shape, the appearance-based features, DCT and rc-DCT, demonstrated good performance compared to that obtained using the shape-based features, AIFDs and BST.…”

Section: Introductionmentioning

confidence: 99%