DBN Based Models for Audio-Visual Speech Analysis and Recognition

Ravyse, Ilse; Jiang, Dongmei; Jiang, Xiaoyue; Lv, Guoyun; Hou, Yanze; Sahli, Hichem; Zhao, Ran

doi:10.1007/11922162_3

Cited by 4 publications

(5 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DBN has been applied in different areas such as audiovisual, speech, gesture recognition [2], [3], [4], medical diagnosis [5], and stock price forecasting [6].…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Next place prediction by understanding mobility patterns

Dash

Koo

Gomes

et al. 2015

2015 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops)

View full text Add to dashboard Cite

As technology to connect people across the world is advancing, there should be corresponding advancement in taking advantage of data that is generated out of such connection. To that end, next place prediction is an important problem for mobility data. In this paper we propose several models using dynamic Bayesian network (DBN). Idea behind development of these models come from typical daily mobility patterns a user have. Three features (location, day of the week (DoW), and time of the day (ToD)) and their combinations are used to develop these models. Knowing that not all models work well for all situations, we developed three combined models using least entropy, highest probability and ensemble. Extensive performance study is conducted to compare these models over two different mobility data sets: a CDR data and Nokia mobile data which is based on GPS. Results show that least entropy and highest probability DBNs perform the best.

show abstract

“…DBN has been applied in different areas such as audiovisual, speech, gesture recognition [2], [3], [4], medical diagnosis [5], and stock price forecasting [6].…”

Section: Methodsmentioning

confidence: 99%

“…In the literature there are several works that use DBN to model classification tasks including speech recognition [2], [3], gesture recognition [4], medical prognostic [5], forecasting [6] among others. The specialty of this paper is next place prediction.…”

Section: Introductionmentioning

confidence: 99%

Next place prediction by understanding mobility patterns

Dash

Koo

Gomes

et al. 2015

2015 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops)

View full text Add to dashboard Cite

show abstract

“…As a comparison, word recognition results of tri-phone HMM and single stream DBN (SS-DBN) model are given in Table 1 too, where SS-DBN is given in [9], For Multi-stream HMM (implemented as product HMM, with four audio and four video HMM states) and a given SNRs (0dB to 30dB), stream exponent of audio stream a λ is varied from 0 to 1 in step of 0.05, and the value of the stream exponent that maximized the word accuracy is chosen. Audio stream exponents in different SNRs are given in Table 1 …”

Section: Experiments and Evaluationmentioning

confidence: 99%

“…2, which starts with the detection and tracking of speaker's face [9]. Since the mouth is the most important speech organs, Bayesian Tangent Shape Model (BTSM) algorithm is used to label automatically the feature point of speaker's lip [8], for every image, 20 feature points include outer contour and inner contour of the mouth are achieved, which is given in Fig.…”

Section: Audio and Visual Feature Extractionmentioning

confidence: 99%

Multi-Stream Asynchrony Dynamic Bayesian Network Model for Audio-Visual Continuous Speech Recognition

Jiang

Zhao

et al. 2007

2007 14th International Workshop on Systems, Signals and Image Processing and 6th EURASIP Conference Focused on Speech and Imag

Self Cite

View full text Add to dashboard Cite

How best to describe the asynchrony of the speech and lip motion is a key problem of audio-visual speech recognition model. A Multi-Stream Asynchrony Dynamic Bayesian Network (MS-ADBN) model is brought forward for audio-visual speech recognition, and in this model, audio stream and visual stream are synchronous in word node, while between the word nodes, each stream has its own independent phone, phone transition and observation vector node, and word transition probability is determined by audio stream and visual stream together. For each stream, each word is composed of its corresponding phones, and each phone is associated with observation feature (audio feature for audio stream and visual feature for visual stream), with some probability modeled by Gaussian mixed model. Compare with general multi-stream HMM, MS-ADBN model describes the asynchrony of audio stream and visual stream to the word level. The experiment results on continuous digit audio visual database show that: compare with multi-stream HMM, in the mismatch noise environment, an average improvement of 10.07% are obtained for MS-ADBN model.

show abstract

“…Visual feature extraction is given in Fig. 4, which starts with the detection and tracking of the speaker's face (Ravyse et al, 2006), Since the mouth is the most important speech organs, the contour of the lips is obtained through the Bayesian Tangent Shape Model (BTSM) (Zhou et al, 2003), for each image, 20 profile points include outer contour and inner contour of the mouth are obtained, which is given in Fig. 5.…”

Section: Audio and Visual Feature Extractionmentioning

confidence: 99%