Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-2359
|View full text |Cite
|
Sign up to set email alerts
|

Speaker Adaptive Audio-Visual Fusion for the Open-Vocabulary Section of AVICAR

Abstract: This experimental study establishes the first audiovisual speech recognition baseline for the TIMIT sentence portion of the AVICAR dataset, a dataset recorded in a real, noisy car environment. We use an automatic speech recognizer trained on a larger dataset to generate an audio-only recognition baseline for AVICAR. We utilize the forced alignment of the audio modality of AVICAR to get training targets for the convolutional neural network based visual front end. Based on our observation that there is a great a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2019
2019
2019
2019

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 14 publications
0
1
0
Order By: Relevance
“…The field of visual speech recognition (VSR), or lipreading, has witnessed dramatic breakthroughs recently, primarily due to the paradigm shift from hand-crafted features to deep learning based models [1][2][3][4][5][6][7][8], coupled with the public release of large suitable corpora in a variety of environments [9][10][11][12][13][14][15], as also reviewed in [16,17]. Such models however, while reducing recognition errors compared to previous approaches, are not as efficient to compute and store.…”
Section: Introductionmentioning
confidence: 99%
“…The field of visual speech recognition (VSR), or lipreading, has witnessed dramatic breakthroughs recently, primarily due to the paradigm shift from hand-crafted features to deep learning based models [1][2][3][4][5][6][7][8], coupled with the public release of large suitable corpora in a variety of environments [9][10][11][12][13][14][15], as also reviewed in [16,17]. Such models however, while reducing recognition errors compared to previous approaches, are not as efficient to compute and store.…”
Section: Introductionmentioning
confidence: 99%