Based on a philosophy of integrating components from multimodal interaction applications with 3D graphical environments, reusing already defined markup language for describing graphics, graphical and spoken interactions based on the interactive movie metaphor, a markup language for modeling scenes, behavior and interaction is sought. With the definition of this language, we hope to have a common framework for developing applications that allow multimodal interaction at 3D stages. Thus we have defined the basis of an architecture that allows us to integrate the components of such multimodal interaction applications in 3D virtual environments.Keywords Spoken interaction · Graphical interaction · Human-computer interaction · Multimodality · Dialogue systems · Avatar · Virtual environments · 3D virtual reality · Rich internet applications · Behavior
Motivation and strategyIntroducing multimodal interaction can enrich user experience (UX) because natural communication includes speaking and gesture. This has been proved with augmented reality (AR) and virtual reality (VR) applications because through spoken interaction objects out of the user's view can be B Hector Olmedo accessed by naming them, freeing the user's hands. Besides, spoken interaction is a fact in mobile and ubiquitous applications. They use speech recognition, they analyze speech signals and produce the labels of recognized words. So they use a spoken modality. Another modality used in mobile applications is based on graphical interaction and it can be combined with speech recognition in several ways as it will be shown in Sect. 3.2.4 depending on how we want the various modalities to cooperate.As a modality is a process analyzing and producing chunks of information [1] and combining several modalities improves user interaction by making it multimodal, there is a W3C recommendation as an architectural framework that is fundamental for integrating modalities. This is the MMI architecture [2], a proposal of the MMI W3C Working group to be introduced in Sect. 3.3.2. The Multimodal Interaction Activity seeks to extend the Web to allow users to dynamically select the most appropriate mode of interaction for their current needs, including any disabilities, while enabling developers to provide an effective user interface for whichever modes the user selects. Depending upon the device, users will be able to provide input via speech, handwriting, touchscreens and keystrokes, with output presented via displays, pre-recorded and synthetic speech, audio, and tactile mechanisms such as mobile phone vibrators and Braille strips. The main issues about multimodal interaction that are not yet covered are: building reliable multimodal systems and usable applications, designing of usable adaptive multimodal interfaces and improving tools for the creation of multimodal applications and interfaces so they can become more mainstream [3]. The challenges of 3D interfaces and VR/AR relevant for multimodal interaction are not solved and they are related with the integration of V...