In this paper, we propose a weighted component-based feature descriptor for expression recognition in video sequences. Firstly, we extract the texture features and structural shape features in three facial regions: mouth, cheeks and eyes of each face image. Then, we combine these extracted feature sets using confidence level strategy. Noting that for different facial components, the contributions to the expression recognition are different, we propose a method for automatically learning different weights to components via the multiple kernel learning. Experimental results on the Extended Cohn-Kanade database show that our approach combining component-based spatiotemporal features descriptor and weight learning strategy achieves better recognition performance than the state of the art methods.