At FXPAL Japan we have built an (experimental) Smart Conference Room (SCR) that contains multiple cameras, microphones, displays, and capture devices. Based on our experience, in this paper we discuss research and open issues in constructing SCRs like the one built at FXPAL for the purpose of automatic content analysis. Our discussion is grounded on a novel conceptual meeting model that consists of physical (from layout to cameras), conceptual (meeting types, actors), sensory (audio-visual capture), and content (syntax and semantics) components. We also discuss storage, retrieval, and deployment issues.
INTRODUCTIONMeetings are important events in any organization and recently there has been a renewed interest in building smart meeting rooms to capture meetings on video for future viewing. This is due to lower computer and video equipment costs, higher computational power, and because keeping accurate records in companies has become more important than ever (for knowledge, risk management, and compliance, among others). In the United States, for example, the SOX act [21] and recent laws require accurate record keeping to ensure the financial data the CEO and CFO sign off on is auditable. Although recording of meetings is not a requirement, it is possible for meeting videos to play an important role in the future: traditional note-taking is insufficient to store all relevant meeting events, it is subjective, often incomplete, and inaccurate.Many smart meeting conference room environments [39][61] [43] and portable meeting systems [38] have been developed. Most of the focus has been on developing techniques to automatically process the generated audiovisual content (e.g., face detection and action recognition [67]; speech recognition for topic detection [62], and many others [3]). However, little attention has been given to the overall meeting capture framework, the issues around building the infrastructure necessary to deploy a real world application, and the impact of such infrastructure on the development of automatic content analysis techniques.In this paper, we propose a multiple-component conceptual meeting model, and give an overview of the major research issues in building and deploying a smart conference room environment from the perspective of automatic content analysis. We discuss issues ranging from physical room layout and hardware infrastructure to automatic content analysis and metadata. Our model (Figure 1) consists of four components: physical structure, conceptual structure, sensory acquisition, and acquired content 1 . The physical component models the objects and layout of a smart meeting room (e.g., tables). The conceptual component models the structure of the meeting (e.g., meeting type, roles). The sensory component models the capture of the meeting using multiple sensing devices (cameras, microphones, etc.). The four components of our model are directly linked by a contextual mesh, which we define as the set of conditions under which the meeting takes place. As the circle in the ce...