Globalisation of people's interaction in the industrial world and ecological cost of transport make video-conference an interesting solution for collaborative work. However, the lack of immersive perception makes video-conference not appealing. TIFANIS 1 tele-immersion system was conceived to let users interact as if they were physically together. In this paper, we focus on an important feature of the immersive system: the automatic tracking of the user's point of view in order to render correctly in his display the scene from the other site. Viewpoint information has to be computed in a very short time and the detection system should be no intrusive, otherwise it would become cumbersome for the user, i.e. he would lose the feeling of "being there". The viewpoint detection system consists of several modules. First, an analysis module identifies and follows regions of interest (ROI) where faces are detected. We will show the cooperative approach between spatial detection and temporal tracking. Secondly, an eye detector finds the position of the eyes within faces. Then, the 3D positions of the eyes are deduced using stereoscopic images from a binocular camera. Finally, the 3D scene is rendered in real-time according to the new point of view.