Having a robust speech recognition system that can be relied upon in different environments is a strong requirement for modern systems. However audioonly speech recognition still lacks robustness when the signal to noise ratio decreases. This is especially true when the system is deployed in public spaces or is used for crises situations management where the background noise is expected to be extremely large. The video information is not affected by noise which makes it an ideal candidate for data fusion. The acoustic features have been well defined during the course of the years, the most used features being mel-frequency cepstral coefficiens (MFCCs) or linear predictive coefficients(LPCs). On the visual side, however, there is still much place for improvements. It is still not clear which visual features retain the most speech related information. Until now the visual features used were static features which describe the face of the user at one instance in time only. In the paper [1] the authors have shown that most of the techniques used for extraction of static visual features result in equivalent features or at least the most informative features exhibit this. This means that all techniques describe the same aspect of the visual stream. However the improvement of recognition even though looks promis- ing is still modest. We argue that the main problem of existing methods is that the resulting features contain no information about the motion of the speaker's lips. We present in this paper a new method for extracting useful features from the point of view of speech recognition based on optical flow analysis. The video features extracted using this method are preserving the information about speaker mouth motion. We tested the method on an audio-video database for Dutch language. The Audio-Visual Speech Recognizer(AVSR) used is based on HMMs method and was trained for large vocabulary continuous speech. For completion we also present the method introduced in the paper [2] for extracting static visual features. We will compare these two methods with respect to the induced recognition performance. Another way to recover motion information from static features is to use their first and/or the second derivative as visual features. However this can not always guaranty that the resulted features are physically sound quantities. We will also present for comparison the recognition results based on such features. The evaluation of these methods will be done under different noise conditions. We show that the audio-video recognition based on the true motion features outperforms the other settings in low Signal to Noise Ratio(SNR) conditions.