Augmented reality (AR) is the mixing of computer-generated stimuli with real-world stimuli. In this paper, we present results from a controlled, empirical study comparing three ways of delivering spatialized audio for AR applications: a speaker array, headphones, and a bone-conduction headset. Analogous to optical-see-through AR in the visual domain, Hear-Through AR allows users to receive computer-generated audio using the bone-conduction headset, and real-world audio using their unoccluded ears. Our results show that subjects achieved the best accuracy using a speaker array physically located around the listener when stationary sounds were played, but that there was no difference in accuracy between the speaker array and the bone-conduction device for sounds that were moving, and that both devices outperformed standard headphones for moving sounds. Subjective comments by subjects following the experiment support this performance data.
INTRODUCTIONAugmented reality (AR) is the mixing of computer-generated stimuli with real-world stimuli. While much work has been done for delivering mixed real-world (RW) and computer-generated (CG) stimuli in the visual domain, we focus here instead on the audio domain. Recently, we introduced two approaches for audio AR: Hear-Through AR and Mic-Through AR [8]. Our work on Hear-Through AR used either a speaker array or a bone-conduction headset (BCH) to deliver CG audio to a user, while RW sound was received through the unoccluded ear canals. Mic-Through AR captures RW sound using microphones mounted near the ears of the user, mixes it with CG sound in the computer, and delivers the resulting AR sound through standard headphones. In this paper, we present the first results from a formal, empirical study comparing subjects' sound-localization capabilities using both speaker-based and BCH-based Hear-Through AR, as well as Mic-Through AR. In order to gather baseline data, this study used only simple, well-controlled audio tones played at three frequencies. However, both static and moving tones were considered, while the head of the user was kept stationary. Lindeman & Noma [7] present a scheme for classifying AR techniques for all the human sensory modalities by where the mixing of CG and RW elements takes place. They underscore the need to correctly match the attributes of CG and RW stimuli so that the user can easily fuse the two, thereby improving the realism of the resulting mixed reality. Two main characteristics differentiate RW and CG audio. Real-world audio is typically of higher fidelity than CG audio. Also, computationally expensive preprocessing of CG audio is required in order to subject CG audio to similar environmental effects to match the RW environmental effects. Where the mixing of these elements takes place can have a significant impact on this computational cost.CG sound can be displayed using speakers placed within the real environment, allowing RW and CG sounds to mix in the environment before reaching the user. Alternatively, Mic-Through AR using two ...