Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

Nock, Harriet J.; Iyengar, G.; Neti, C.

doi:10.1007/3-540-45113-7_48

Cited by 63 publications

(73 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Then the window is classified as a human, and the HOG features [16] will be extracted and tested with the PL-SVM of the next stage to finally decide whether it is a human being or not.…”

Section: G Two Stage Pl-svm Classificationsmentioning

confidence: 99%

A Study of Real Time Human Detection Data System from Dynamic Visuals through QEVR and PL-SVM Techniques

Aurobind¹

2018

IJRASET

View full text Add to dashboard Cite

The computer vision and pattern recognition community are the most important research area which increases the human detection of images mainly because of its applications like driving assistance system, content-based image retrieval and video surveillance. Digital image processing mainly deals with changing the nature of the image as required. This paper a video scene retrieval algorithm based on the query by example video retrieval (QEVR). The prior work on video analysis based on extracting visual features, a special framework is used for video retrieval in two stages using query by example which uses both the audio and video. The dynamic video into audio and visuals are separated by query by example. Again the visual is given as input to the PL-SVM technique to detect the human from the visuals. The proposed approach delivers good performance in the image retrieval.

show abstract

“…Then the window is classified as a human, and the HOG features [16] will be extracted and tested with the PL-SVM of the next stage to finally decide whether it is a human being or not.…”

Section: G Two Stage Pl-svm Classificationsmentioning

confidence: 99%

A Study of Real Time Human Detection Data System from Dynamic Visuals through QEVR and PL-SVM Techniques

Aurobind¹

2018

IJRASET

View full text Add to dashboard Cite

show abstract

“…We use the 22 clips from the groups set in which two speakers take turns reading digit strings and then proceed to speak simultaneously. In order to compare to [4] and [13] we only consider the section of alternating speech. In each clip both individuals face the camera at all times.…”

Section: Audio Visual Experimentsmentioning

confidence: 99%

“…To the best of our knowledge these results are equivalent to or better than all other reported results for speaker labeling on the CUAVE group set. Nock and Iyengar [4] obtain 75% accuracy with a windowed Gaussian MI measure and Gurban and Thiran [13] get 87.4% with a trained audio-visual speech detector. Both methods use a silence/speech detector and only perform a dependence test when there is speech.…”

Section: Audio Visual Experimentsmentioning

confidence: 99%

“…This model allows us to take advantage of both structural and parametric changes associated with changes in speaker. This is contrasted with standard sliding window based dependence analysis [1,2,3,4].…”

Section: Introductionmentioning

confidence: 99%

“…Specific to audio-visual association, Hershey and Movellan showed how measuring correlation between audio and pixels can help in detecting who is speaking [1]. Nock and Iyengar [4] provided an empirical study of this technique on the CUAVE dataset [10]. Further study of detecting and characterizing the dependency between audio and video was carried out by Slaney and Covell [2] and Fisher, et al [3].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Dynamic Dependency Tests for Audio-Visual Speaker Association

Siracusa¹,

Fisher²

2007

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07

View full text Add to dashboard Cite

We formulate the problem of audio-visual speaker association as a dynamic dependency test. That is, given an audio stream and multiple video streams, we wish to determine their dependancy structure as it evolves over time. To this end, we propose the use of a hidden factorization Markov model in which the hidden state encodes a finite number of possible dependency structures. Each dependency structure has an explicit semantic meaning, namely "who is speaking." This model takes advantage of both structural and parametric changes associated with changes in speaker. This is contrasted with standard sliding window based dependence analysis. Using this model we obtain state-of-the-art performance on an audio-visual association task without benefit of training data.

show abstract