Intelligent Sound Source Localization and its application to multimodal human tracking

Nakamura, Keisuke; Nakadai, Kazuhiro; Asano, Futoshi; İnce, Gökhan

doi:10.1109/iros.2011.6094558

Cited by 45 publications

(25 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In such a case, it can be difficult to easily identify the noise and signal spaces from the eigenvalue decomposition of the array cross-correlation matrix Γ M . For this reason, the GEVD-MUSIC (Generalized Eigen Value Decomposition-MUSIC) is proposed [66]. It consists in defining an additional freely-tunable correlation matrix Γ N for the frequency k 0 , and solving the new GEVD problem…”

Section: Musicmentioning

confidence: 99%

A survey on sound source localization in robotics: From binaural to array processing methods

Argentieri

Danès

Souères

2015

Computer Speech & Language

135

View full text Add to dashboard Cite

This paper attempts to provide a state-of-the-art of sound source localization in Robotics. Noticeably, this context raises original constraints-e.g. embeddability, real time, broadband environments, noise and reverberation-which are seldom simultaneously taken into account in Acoustics or Signal Processing. A comprehensive review is proposed of recent robotics achievements, be they binaural or rooted in Array Processing techniques. The connections are highlighted with the underlying theory as well as with elements of physiology and neurology of human hearing.

show abstract

Section: Musicmentioning

confidence: 99%

A survey on sound source localization in robotics: From binaural to array processing methods

Argentieri

Danès

Souères

2015

Computer Speech & Language

135

View full text Add to dashboard Cite

show abstract

“…First, the location information of the persons can be used as multi-modal sensing and analysis of poster conversations with smart posterboard a constraint in the speaker localization. This is a straightforward multi-modal integration in speaker diarization [8,9]. In this study, furthermore, we investigate the use of eyegaze information for speaker diarization as it is shown in Section IV that eye-gaze information is useful for predicting turn-taking by the audience.…”

Section: B) Multi-modal Sensingmentioning

confidence: 99%

Multi-modal sensing and analysis of poster conversations with smart posterboard

Kawahara

Iwatate

Inoue

et al. 2016

SIP

View full text Add to dashboard Cite

Multi I . I N T R O D U C T I O NMulti-modal signal and information processing has been investigated primarily for intelligent human-machine interfaces, including smart phones, KIOSK terminals, and humanoid robots. Meanwhile, speech and imageprocessing technologies have been improved so much that their target now includes natural human-human behaviors, which are made without being aware of interface devices. In this scenario, sensing devices are installed in an ambient manner. Examples of this kind of direction include meeting capturing [1] and conversation analysis [2].We have been conducting a project which focuses on conversations in poster sessions, hereafter referred to as poster conversations [3,4]. Poster sessions have become a norm in many academic conventions and open laboratories because of the flexible and interactive characteristics. In most cases, however, paper posters are still used even in the ICT areas. In some cases, digital devices such as LCD and PC projectors are used, but they do not have sensing devices. Currently, many lectures in academic events are recorded and distributed via Internet, but recording of poster sessions is never done or even tried.Poster conversations have a mixture characteristics of lectures and meetings; typically a presenter explains his/her Academic Center for Computing and Media Studies, Kyoto University, Sakyo-ku, Kyoto 606-8501, JapanCorresponding author: T. Kawahara Email: kawahara@i.kyoto-u.ac.jp work to a small audience using a poster, and the audience gives feedbacks in real time by nodding and verbal backchannels, and occasionally makes questions and comments. Conversations are interactive and also multi-modal because participants are standing and moving unlike in meetings. Another good point of poster conversations is that we can easily make a setting for data collection which is controlled in terms of familiarity with topics and other participants and yet is "natural and real".The goal of this study is signal-level sensing and highlevel analysis of human interactions. Specific tasks include face detection, eye-gaze detection, speech separation, and speaker diarization. These will realize a new indexing scheme of poster session archives. For example, after a long session of poster presentation, we often want to get a short review of the question-answers and feedbacks from the audience. We also investigate high-level indexing of which segment was attractive and/or difficult for the audience to follow. This will be useful in speech archives because people would be interested in listening to the points other people liked. However, estimation of the interest and comprehension level is apparently difficult and largely subjective. Therefore, we turn to speech acts which are observable and presumably related with these mental states. One is prominent reactive tokens signaled by the audience and the other is questions raised by them. Prediction of these speech acts from multimodal behaviors is expected to approximate the estimation of the interest and comprehension...

show abstract

“…Acoustic source localization has been largely restricted to estimating azimuth [26][27][28][29][30][31][32][33] on the assumption of zero elevation, except where audition has been fused with vision for estimates also of elevation [34,35,37,38]. Information gathered as the head is turned has been exploited either to locate the azimuth at which ITD reduces to zero thereby determining the azimuthal direction to a source, or to resolve the front-back ambiguity associated with estimating only azimuth [28][29][30][31][32][33][34]39,40].…”

Section: Introductionmentioning

confidence: 99%

Synthetic Aperture Computation as the Head is Turned in Binaural Direction Finding

Tamsett

2017

Robotics

View full text Add to dashboard Cite

Binaural systems measure instantaneous time/level differences between acoustic signals received at the ears to determine angles λ between the auditory axis and directions to acoustic sources. An angle λ locates a source on a small circle of colatitude (a lamda circle) on a sphere symmetric about the auditory axis. As the head is turned while listening to a sound, acoustic energy over successive instantaneous lamda circles is integrated in a virtual/subconscious field of audition. The directions in azimuth and elevation to maxima in integrated acoustic energy, or to points of intersection of lamda circles, are the directions to acoustic sources. This process in a robotic system, or in nature in a neural implementation equivalent to it, delivers its solutions to the aurally informed worldview. The process is analogous to migration applied to seismic profiler data, and to that in synthetic aperture radar/sonar systems. A slanting auditory axis, e.g., possessed by species of owl, leads to the auditory axis sweeping the surface of a cone as the head is turned about a single axis. Thus, the plane in which the auditory axis turns continuously changes, enabling robustly unambiguous directions to acoustic sources to be determined.

show abstract

Intelligent Sound Source Localization and its application to multimodal human tracking

Cited by 45 publications

References 10 publications

A survey on sound source localization in robotics: From binaural to array processing methods

A survey on sound source localization in robotics: From binaural to array processing methods

Multi-modal sensing and analysis of poster conversations with smart posterboard

Synthetic Aperture Computation as the Head is Turned in Binaural Direction Finding

Contact Info

Product

Resources

About