Although many progresses have been accomplished in multimodal interaction, most researchers still treat each modality such as vision and speech, separately. They integrate the results at the application stage. This is because the roles of multiple modalities and their interactions continue to be quantified and precisely understood. However, there are many remaining issues in combining each modality individually. This paper will highlight the main vision problems based on our review for multimodal applications. This review paper will give an overview of the Augmented Reality (AR) technologies which are contributing in most of recent multimodal applications. We cluster vision techniques according to the natural human senses such as face, gesture, and speech that are frequently used in multimodal applications. The main contribution of this paper is to consolidate some of the main issues and approaches in vision-based technique, and to study some of the applications in AR that have been developed within the context of multimodal interaction. We conclude this paper with the future directions.