Emotions dynamically change in response to ever-changing environments. It is of great importance, both clinically and scientifically, to investigate the neural representation and evoking mechanism of emotion dynamics. But, there are many unknown places in this stream of research, such as consistent and conclusive findings are still lacking. In this work, we perform an in-depth investigation of emotion dynamics under a video-watching task by gauging the dynamic associations among evoked emotions, electroencephalography (EEG) responses, and multimedia stimulation. Here, we introduce EEG microstate analysis to study emotional EEG signals, which provides a spatial-temporal neural representation of emotion dynamics. To investigate the temporal characteristics of evoking emotions during video watching with its neural mechanism, we conduct two studies from the perspective of EEG microstates. In Study 1, the dynamic microstate activities under different emotion states and emotion levels are explored to identify EEG spatial-temporal correlates of emotion dynamics. In Study 2, the stimulation effects of multimedia content (visual and audio) on EEG microstate activities are examined to learn about the involved affective information and investigate the emotion-evoking mechanism. The results show that emotion dynamics could be well reflected by four EEG microstates (MS1, MS2, MS3, and MS4). Specifically, emotion tasks lead to an increase in MS2 and MS4 coverage but a decrease in MS3 coverage, duration, and occurrence. Meanwhile, there exists a negative association between valence and MS4 occurrence as well as a positive association between arousal and MS3 coverage and occurrence. Further, we find that MS4 and MS3 activities are significantly affected by visual and audio content, respectively. In this work, we verify the possibility to reveal emotion dynamics through EEG microstate analysis from sensory and stimulation dimensions, where EEG microstate features are found to be highly correlated to different emotion states (emotion task effect and level effect) and different affective information involved in the multimedia content (visual and audio). Our work deepens the understanding of the neural representation and evoking mechanism of emotion dynamics, which can be beneficial for future development in the applications of emotion decoding and regulation.