Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-721
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
14
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 22 publications
(15 citation statements)
references
References 10 publications
0
14
0
Order By: Relevance
“…Both voice and visual data were used in the method and extracted features were included in system for classification. In another work, Takashima et al [24] proposed a new approach for lip reading using the combination of lip image and sound features by using deep learning. The proposed method was tested on ATR Japanese speech dataset.…”
Section: Related Workmentioning
confidence: 99%
“…Both voice and visual data were used in the method and extracted features were included in system for classification. In another work, Takashima et al [24] proposed a new approach for lip reading using the combination of lip image and sound features by using deep learning. The proposed method was tested on ATR Japanese speech dataset.…”
Section: Related Workmentioning
confidence: 99%
“…Recently in [3] automatic classification using Support Vector Machines (SVM) between 20 CI users and 20 healthy speakers was performed in order to evaluate articulation disorders considering acoustic features. For the case of pathological speech detection, CNNs have outperformed classical machine learning methods [4][5][6]. In these studies, the conventional method is to perform time-frequency analysis by computing spectrograms over the speech signals to feed the CNNs with single channel inputs.…”
Section: Introductionmentioning
confidence: 99%
“…For these reasons, when targeting continuous lip-reading it is convenient to predict 1340 smaller structures that approach the minimum distinguishable language units. Recent advances in end-toend DL architectures have indeed focused on ALR systems that try to predict phonemes [149,51,139,83] or characters [16,25,161,165], instead of full words 1345 or pre-defined sentences. For example, Mroueh et al [83] proposed Feed-forward DNNs to predict phonemes using the IBM AV-ASR database, a large scale nonpublic AV database.…”
mentioning
confidence: 99%
“…For example, Mroueh et al [83] proposed Feed-forward DNNs to predict phonemes using the IBM AV-ASR database, a large scale nonpublic AV database. Other architectures using CNNs and HMMs were presented by Noda et al [51,139] and by Takashima et al [149]. They tried to recognize Japanese phonemes using the ATR Japanese corpus [178] and obtained 22.50% WRR, 37.00% WRR and 51.00% WRR, respectively.…”
mentioning
confidence: 99%
See 1 more Smart Citation