Multi
I . I N T R O D U C T I O NMulti-modal signal and information processing has been investigated primarily for intelligent human-machine interfaces, including smart phones, KIOSK terminals, and humanoid robots. Meanwhile, speech and imageprocessing technologies have been improved so much that their target now includes natural human-human behaviors, which are made without being aware of interface devices. In this scenario, sensing devices are installed in an ambient manner. Examples of this kind of direction include meeting capturing [1] and conversation analysis [2].We have been conducting a project which focuses on conversations in poster sessions, hereafter referred to as poster conversations [3,4]. Poster sessions have become a norm in many academic conventions and open laboratories because of the flexible and interactive characteristics. In most cases, however, paper posters are still used even in the ICT areas. In some cases, digital devices such as LCD and PC projectors are used, but they do not have sensing devices. Currently, many lectures in academic events are recorded and distributed via Internet, but recording of poster sessions is never done or even tried.Poster conversations have a mixture characteristics of lectures and meetings; typically a presenter explains his/her Academic Center for Computing and Media Studies, Kyoto University, Sakyo-ku, Kyoto 606-8501, JapanCorresponding author: T. Kawahara Email: kawahara@i.kyoto-u.ac.jp work to a small audience using a poster, and the audience gives feedbacks in real time by nodding and verbal backchannels, and occasionally makes questions and comments. Conversations are interactive and also multi-modal because participants are standing and moving unlike in meetings. Another good point of poster conversations is that we can easily make a setting for data collection which is controlled in terms of familiarity with topics and other participants and yet is "natural and real".The goal of this study is signal-level sensing and highlevel analysis of human interactions. Specific tasks include face detection, eye-gaze detection, speech separation, and speaker diarization. These will realize a new indexing scheme of poster session archives. For example, after a long session of poster presentation, we often want to get a short review of the question-answers and feedbacks from the audience. We also investigate high-level indexing of which segment was attractive and/or difficult for the audience to follow. This will be useful in speech archives because people would be interested in listening to the points other people liked. However, estimation of the interest and comprehension level is apparently difficult and largely subjective. Therefore, we turn to speech acts which are observable and presumably related with these mental states. One is prominent reactive tokens signaled by the audience and the other is questions raised by them. Prediction of these speech acts from multimodal behaviors is expected to approximate the estimation of the interest and comprehension...